HDA Exam

Question 1: Descriptive Table

Build a descriptive table, comparing patients with the event death (within 10 years) versus patients without the event in that time frame. Insert also a column with the total population descriptive statistics. Comment about percentages of missing data in the candidate predictors.

1.1 Format variables

#explore the type of variables
str(survey)
## 'data.frame':    6989 obs. of  21 variables:
##  $ Age               : int  39 65 74 57 74 44 39 70 42 72 ...
##  $ Diastolic.BP      : int  90 85 110 82 80 95 90 95 75 80 ...
##  $ Poverty.index     : int  52 137 89 129 63 99 258 54 258 184 ...
##  $ Race              : int  2 2 2 2 2 2 2 2 1 1 ...
##  $ Red.blood.cells   : num  4.91 4.08 4.46 4.33 4.73 3.81 4.69 4.93 4.11 5.01 ...
##  $ Sedimentation.rate: int  25 20 46 22 8 50 4 32 18 13 ...
##  $ Serum.Albumin     : num  4.9 5.1 4.3 4.4 4.4 4.6 4.4 4.2 5 4.5 ...
##  $ Serum.Cholesterol : num  243 254 337 249 141 ...
##  $ Serum.Iron        : int  84 84 102 95 131 86 102 144 111 161 ...
##  $ Serum.Magnesium   : num  1.62 1.74 1.91 1.55 1.81 1.8 1.8 1.61 1.69 1.78 ...
##  $ Serum.Protein     : num  8.8 7.8 7.5 7.7 7.5 7.7 7 8.3 6.8 6.6 ...
##  $ Sex               : int  2 2 2 2 1 2 2 1 2 1 ...
##  $ Systolic.BP       : int  135 180 190 150 NA 150 120 NA 115 NA ...
##  $ TIBC              : int  423 363 287 362 315 382 401 352 468 367 ...
##  $ TS                : num  19.9 23.1 35.5 26.2 41.6 22.5 25.4 40.9 23.7 43.9 ...
##  $ White.blood.cells : num  7.8 8.4 8.1 6.1 3.8 6.5 6.1 5.2 6 8.4 ...
##  $ BMI               : num  42 28.6 23 33.4 27.3 ...
##  $ Pulse.pressure    : int  45 95 80 68 62 55 30 105 40 70 ...
##  $ time              : num  20.9 21.1 12.8 20 20.9 ...
##  $ status            : int  0 0 1 1 0 0 0 1 0 1 ...
##  $ death             : int  0 0 0 0 0 0 0 1 0 0 ...

Some variables coded as int are actually categorical, thus I proceed to recode them as factor.

#rename time variable to avoid conflicts
names(survey)[names(survey) == "time"] <- "time_death"

# recode categoric variable as factor
categoric_vars<-c('Race','Sex','status','death')
survey[,categoric_vars]<-lapply(survey[,categoric_vars], as.factor)
numeric_vars<- survey %>% dplyr::select(-(all_of(categoric_vars))) %>% colnames()

#assign labels in order to make easier the comparisons
survey$Race=factor(survey$Race,
                           labels = c("White",
                           "afro-american",
                            "other"))
survey$Sex = factor(survey$Sex, labels = c("male", "female"))
str(survey)
## 'data.frame':    6989 obs. of  21 variables:
##  $ Age               : int  39 65 74 57 74 44 39 70 42 72 ...
##  $ Diastolic.BP      : int  90 85 110 82 80 95 90 95 75 80 ...
##  $ Poverty.index     : int  52 137 89 129 63 99 258 54 258 184 ...
##  $ Race              : Factor w/ 3 levels "White","afro-american",..: 2 2 2 2 2 2 2 2 1 1 ...
##  $ Red.blood.cells   : num  4.91 4.08 4.46 4.33 4.73 3.81 4.69 4.93 4.11 5.01 ...
##  $ Sedimentation.rate: int  25 20 46 22 8 50 4 32 18 13 ...
##  $ Serum.Albumin     : num  4.9 5.1 4.3 4.4 4.4 4.6 4.4 4.2 5 4.5 ...
##  $ Serum.Cholesterol : num  243 254 337 249 141 ...
##  $ Serum.Iron        : int  84 84 102 95 131 86 102 144 111 161 ...
##  $ Serum.Magnesium   : num  1.62 1.74 1.91 1.55 1.81 1.8 1.8 1.61 1.69 1.78 ...
##  $ Serum.Protein     : num  8.8 7.8 7.5 7.7 7.5 7.7 7 8.3 6.8 6.6 ...
##  $ Sex               : Factor w/ 2 levels "male","female": 2 2 2 2 1 2 2 1 2 1 ...
##  $ Systolic.BP       : int  135 180 190 150 NA 150 120 NA 115 NA ...
##  $ TIBC              : int  423 363 287 362 315 382 401 352 468 367 ...
##  $ TS                : num  19.9 23.1 35.5 26.2 41.6 22.5 25.4 40.9 23.7 43.9 ...
##  $ White.blood.cells : num  7.8 8.4 8.1 6.1 3.8 6.5 6.1 5.2 6 8.4 ...
##  $ BMI               : num  42 28.6 23 33.4 27.3 ...
##  $ Pulse.pressure    : int  45 95 80 68 62 55 30 105 40 70 ...
##  $ time_death        : num  20.9 21.1 12.8 20 20.9 ...
##  $ status            : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 2 ...
##  $ death             : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...

1.2 Visualization

Sometimes visualize data can help to understand better their meaning.

Categorical variables

# categorical variables
barR <- ggplot(data = survey, aes(x=Race, fill = "Race")) + 
  geom_bar(width = 0.6) + xlab("Race")+
  ggtitle("Race") +  
  scale_fill_manual(values=c("Orange", "#bb44f0", "#45ADA8"))+theme_classic()

barS <- ggplot(data = survey, aes( x =status, fill = "status")) + 
  geom_bar(width = 0.6) + 
  xlab("Status") +ylab("Count")+
  ggtitle("Status") + 
  scale_fill_manual(values=c("#bb44f0"))+theme_classic()
#stacked.barS

barSe <- ggplot(data = survey, aes( x =Sex, fill = "Sex")) + 
  geom_bar(width = 0.6) + 
  xlab("Sex") +ylab("Count")+
  ggtitle("Sex") +
  scale_fill_manual(values=c("#45ADA8")) + theme_classic()
#stacked.barSe

tot.p <- grid.arrange(barR,barS,barSe, ncol = 3) 

Let’s see how data distributed among the variable death.

windowsFonts(A = windowsFont("Arvo"))
# categorical variables
# consider the relative frequency to have a better understanding

stacked.barR <- ggplot(data = survey, aes( x =Race, fill = death)) + 
  geom_bar(position = "fill") + 
  xlab("Race") +ylab("Relative Frequency")+
  ggtitle("Deaths per Race") +  
  scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()
#stacked.barR

stacked.barS <- ggplot(data = survey, aes( x =status, fill = death)) + 
  geom_bar(position = "fill") + 
  xlab("Status") +ylab("Relative Frequency")+
  ggtitle("Deaths per status") + 
  scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()
#stacked.barS

stacked.barSe <- ggplot(data = survey, aes( x =Sex, fill = death)) + 
  geom_bar(position = "fill") + 
  xlab("Sex") +ylab("Relative Frequency")+
  ggtitle("Deaths per Sex") +
  scale_fill_manual(values=c("Orange", "#bb44f0")) + theme_classic()
#stacked.barSe

tot.p <- grid.arrange(stacked.barR,stacked.barS, stacked.barSe, ncol = 3) 

Continous variables

Since some variables appear to be positively skewed we apply square root or log transformation to make them more normally-distributed.

#transformation of Skewed variables

transform_sqrt<-c("Pulse.pressure", "Serum.Iron", "Poverty.index")
transform_log <-c("BMI", "Serum.Cholesterol","Red.blood.cells","White.blood.cells", "Sedimentation.rate")
survey_trans<-survey
survey_trans[,transform_log]<-lapply(survey[,transform_log], log)
survey_trans[,transform_sqrt]<-lapply(survey[,transform_sqrt], sqrt)


for (var in c(transform_log, transform_sqrt)){

  hist.p<-gghistogram(survey, x = var, y = "..count..", add = "mean",bins = 20, fill = "orange", rug = TRUE, color = "orange", title = "Original")+theme_classic()

  hist.t<-gghistogram(survey_trans, x = var, y = "..count..", add = "mean", bins = 20, fill = "orange",rug = TRUE, color = "Orange", title = "Transformed")+theme_classic()
  tot.p <- grid.arrange(hist.p, hist.t, ncol = 2)

}

Let’s see how data distributed among the variable death.

for (var in numeric_vars){

  hist.p<-gghistogram(survey, x = var, y = "..density..", add = "median", bins = 20, rug = TRUE,fill = "death", add_density = TRUE, palette = c("Orange", "#bb44f0"))

  bx.p <- ggboxplot(survey, x="death", y=var, ylab = var, color ="death" ,palette = c("Orange", "#bb44f0"))
  tot.p <- grid.arrange(bx.p,hist.p, ncol = 2)

}

1.3 Descriptive table

A first rough overview on descriptive statistics can be given by function summary()

summary(survey)
##       Age         Diastolic.BP    Poverty.index              Race     
##  Min.   :25.00   Min.   : 34.00   Min.   :  2.0   White        :5751  
##  1st Qu.:35.00   1st Qu.: 74.00   1st Qu.:137.0   afro-american:1158  
##  Median :49.00   Median : 82.00   Median :238.0   other        :  80  
##  Mean   :49.76   Mean   : 83.32   Mean   :289.5                       
##  3rd Qu.:66.00   3rd Qu.: 90.00   3rd Qu.:373.0                       
##  Max.   :74.00   Max.   :180.00   Max.   :999.0                       
##                                                                       
##  Red.blood.cells Sedimentation.rate Serum.Albumin   Serum.Cholesterol
##  Min.   :2.550   Min.   : 1.00      Min.   :2.700   Min.   : 53.0    
##  1st Qu.:4.400   1st Qu.: 7.00      1st Qu.:4.200   1st Qu.:188.0    
##  Median :4.710   Median :13.00      Median :4.400   Median :218.0    
##  Mean   :4.733   Mean   :15.86      Mean   :4.372   Mean   :222.5    
##  3rd Qu.:5.050   3rd Qu.:22.00      3rd Qu.:4.600   3rd Qu.:250.0    
##  Max.   :8.880   Max.   :65.00      Max.   :5.600   Max.   :793.0    
##                                                                      
##    Serum.Iron    Serum.Magnesium Serum.Protein      Sex        Systolic.BP   
##  Min.   : 17.0   Min.   :0.82    Min.   : 4.4   male  :2744   Min.   : 80.0  
##  1st Qu.: 75.0   1st Qu.:1.59    1st Qu.: 6.8   female:4245   1st Qu.:115.0  
##  Median : 96.0   Median :1.68    Median : 7.1                 Median :126.0  
##  Mean   :100.6   Mean   :1.68    Mean   : 7.1                 Mean   :130.7  
##  3rd Qu.:121.0   3rd Qu.:1.77    3rd Qu.: 7.4                 3rd Qu.:140.0  
##  Max.   :396.0   Max.   :2.70    Max.   :11.5                 Max.   :260.0  
##                                                               NA's   :1389   
##       TIBC             TS         White.blood.cells      BMI       
##  Min.   :168.0   Min.   :  3.20   Min.   : 2.30     Min.   :12.94  
##  1st Qu.:323.0   1st Qu.: 21.00   1st Qu.: 6.00     1st Qu.:22.14  
##  Median :356.0   Median : 27.10   Median : 7.20     Median :24.93  
##  Mean   :362.9   Mean   : 28.36   Mean   : 7.46     Mean   :25.68  
##  3rd Qu.:396.0   3rd Qu.: 34.30   3rd Qu.: 8.60     3rd Qu.:28.31  
##  Max.   :691.0   Max.   :100.00   Max.   :56.00     Max.   :72.22  
##                                                                    
##  Pulse.pressure     time_death       status   death   
##  Min.   : 10.00   Min.   : 0.01644   0:4488   0:5888  
##  1st Qu.: 40.00   1st Qu.:13.84429   1:2501   1:1101  
##  Median : 48.00   Median :18.84977                    
##  Mean   : 51.45   Mean   :16.28802                    
##  3rd Qu.: 60.00   3rd Qu.:20.01370                    
##  Max.   :150.00   Max.   :21.81142                    
## 

From this summarywe can already notice that just one variable has missing values: Systolic.BP.

Let’s see the descriptive statistics in a fancier way, comparing the patients with the event and those without it, adding also a column with total population’s statistics.

# Descriptive statistics of patients'characteristics by outcome
survey %>% 
  dplyr::select(all_of(numeric_vars), all_of(categoric_vars)) %>%
  tbl_summary(by = death,
              type = all_continuous() ~ "continuous2",
              statistic = all_continuous() ~ c(
                                     "{mean} ({sd})",         
                                     "{median} ({p25}, {p75})", #median with 1th-3th percentiles
                                     "{min}, {max}",
                                     "{N_miss} ({p_miss} %)"),
              missing = "no") %>% 
  add_overall() %>% 
  add_p() %>%
  bold_labels()
Characteristic Overall, N = 6,9891 0, N = 5,888 1, N = 1,101 p-value2
Age <0.001
Mean (SD) 50 (16) 47 (15) 64 (10)
Median (IQR) 49 (35, 66) 44 (33, 62) 68 (63, 71)
Range 25, 74 25, 74 25, 74
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Diastolic.BP <0.001
Mean (SD) 83 (13) 83 (13) 87 (15)
Median (IQR) 82 (74, 90) 80 (74, 90) 86 (78, 96)
Range 34, 180 34, 180 50, 144
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Poverty.index <0.001
Mean (SD) 289 (220) 298 (218) 246 (223)
Median (IQR) 238 (137, 373) 248 (148, 377) 172 (99, 312)
Range 2, 999 2, 999 4, 999
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Red.blood.cells 0.021
Mean (SD) 4.73 (0.49) 4.73 (0.47) 4.76 (0.55)
Median (IQR) 4.71 (4.40, 5.05) 4.71 (4.40, 5.03) 4.74 (4.38, 5.12)
Range 2.55, 8.88 2.55, 8.88 2.93, 6.56
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Sedimentation.rate <0.001
Mean (SD) 16 (11) 15 (11) 20 (14)
Median (IQR) 13 (7, 22) 13 (7, 21) 17 (9, 29)
Range 1, 65 1, 60 1, 65
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Serum.Albumin <0.001
Mean (SD) 4.37 (0.33) 4.39 (0.32) 4.27 (0.34)
Median (IQR) 4.40 (4.20, 4.60) 4.40 (4.20, 4.60) 4.30 (4.10, 4.50)
Range 2.70, 5.60 2.70, 5.60 2.70, 5.60
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Serum.Cholesterol <0.001
Mean (SD) 222 (50) 221 (49) 232 (53)
Median (IQR) 218 (188, 250) 216 (187, 248) 227 (196, 262)
Range 53, 793 53, 793 87, 691
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Serum.Iron <0.001
Mean (SD) 101 (37) 101 (37) 98 (37)
Median (IQR) 96 (75, 121) 96 (76, 121) 93 (72, 116)
Range 17, 396 17, 301 23, 396
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Serum.Magnesium 0.035
Mean (SD) 1.68 (0.14) 1.68 (0.14) 1.67 (0.16)
Median (IQR) 1.68 (1.59, 1.77) 1.68 (1.59, 1.77) 1.67 (1.58, 1.77)
Range 0.82, 2.70 0.82, 2.70 0.98, 2.49
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Serum.Protein 0.077
Mean (SD) 7.10 (0.50) 7.09 (0.49) 7.13 (0.54)
Median (IQR) 7.10 (6.80, 7.40) 7.10 (6.80, 7.40) 7.10 (6.80, 7.40)
Range 4.40, 11.50 4.40, 11.50 5.30, 10.20
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Systolic.BP <0.001
Mean (SD) 131 (23) 129 (22) 146 (28)
Median (IQR) 126 (115, 140) 124 (114, 140) 142 (125, 162)
Range 80, 260 80, 260 88, 240
N missing (% missing %) 1,389 (20 %) 879 (15 %) 510 (46 %)
TIBC <0.001
Mean (SD) 363 (58) 365 (58) 350 (56)
Median (IQR) 356 (323, 396) 358 (326, 397) 345 (313, 382)
Range 168, 691 175, 691 168, 607
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
TS 0.8
Mean (SD) 28 (11) 28 (11) 29 (11)
Median (IQR) 27 (21, 34) 27 (21, 34) 27 (21, 35)
Range 3, 100 3, 100 4, 100
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
White.blood.cells 0.020
Mean (SD) 7.46 (2.29) 7.42 (2.18) 7.65 (2.80)
Median (IQR) 7.20 (6.00, 8.60) 7.20 (6.00, 8.50) 7.30 (6.00, 8.80)
Range 2.30, 56.00 2.30, 51.20 2.90, 56.00
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
BMI 0.070
Mean (SD) 25.7 (5.1) 25.6 (5.1) 25.8 (5.3)
Median (IQR) 24.9 (22.1, 28.3) 24.9 (22.1, 28.2) 25.2 (22.2, 28.7)
Range 12.9, 72.2 14.2, 63.2 12.9, 72.2
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Pulse.pressure <0.001
Mean (SD) 51 (18) 49 (17) 63 (22)
Median (IQR) 48 (40, 60) 46 (38, 58) 60 (46, 76)
Range 10, 150 10, 150 12, 150
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
time_death <0.001
Mean (SD) 16.3 (5.5) 18.3 (2.7) 5.4 (2.8)
Median (IQR) 18.8 (13.8, 20.0) 19.1 (18.1, 20.2) 5.5 (3.1, 7.9)
Range 0.0, 21.8 10.0, 21.8 0.0, 10.0
N missing (% missing %) 0 (0 %) 0 (0 %) 0 (0 %)
Race <0.001
White 5,751 (82%) 4,899 (83%) 852 (77%)
afro-american 1,158 (17%) 925 (16%) 233 (21%)
other 80 (1.1%) 64 (1.1%) 16 (1.5%)
Sex <0.001
male 2,744 (39%) 2,090 (35%) 654 (59%)
female 4,245 (61%) 3,798 (65%) 447 (41%)
status <0.001
0 4,488 (64%) 4,488 (76%) 0 (0%)
1 2,501 (36%) 1,400 (24%) 1,101 (100%)

1 c("Mean (SD)", "Median (IQR)", "Range", "N missing (% missing %)"); n (%)

2 Wilcoxon rank sum test; Pearson's Chi-squared test

As we said, just the variable Systolic.BP, referring to the systolic blood pressure, has Missing Values. The percentage of missing values is > 5%, indeed it has 20% of missing values. This implies that we cannot ignore them but instead we have to handle them.

Question 2: Missing Values

You will probably note that one variable has a non-negligible rate of missing values. Do you think that the missing values are completely at random, at random or not at random?

Let’s see some statistical descriptive about the distribution among missing and non missing.

#Adding a categorical variable indicating missing and non missing Systolic.BP values for each row

survey$systolic_test <- factor(ifelse(is.na(survey$Systolic.BP),"missing","non-missing"))

# Descriptive statistics of patients'characteristics by outcome
df = subset(survey, select=-Systolic.BP)
tbl_summary(df, by = systolic_test,
              type = all_continuous() ~ "continuous2",
              statistic = all_continuous() ~ c(
                                      "{mean} ({sd})",         
                                     "{median} ({p25}, {p75})", #median with percentile
                                     "{min}, {max}"),
              missing = "no") %>% 
  add_overall() %>% 
  bold_labels()
Characteristic Overall, N = 6,9891 missing, N = 1,389 non-missing, N = 5,600
Age
Mean (SD) 50 (16) 69 (4) 45 (14)
Median (IQR) 49 (35, 66) 69 (66, 72) 43 (33, 56)
Range 25, 74 39, 74 25, 74
Diastolic.BP
Mean (SD) 83 (13) 87 (13) 83 (13)
Median (IQR) 82 (74, 90) 86 (80, 94) 80 (74, 90)
Range 34, 180 50, 150 34, 180
Poverty.index
Mean (SD) 289 (220) 266 (240) 295 (214)
Median (IQR) 238 (137, 373) 190 (106, 316) 248 (147, 384)
Range 2, 999 11, 999 2, 999
Race
White 5,751 (82%) 1,147 (83%) 4,604 (82%)
afro-american 1,158 (17%) 227 (16%) 931 (17%)
other 80 (1.1%) 15 (1.1%) 65 (1.2%)
Red.blood.cells
Mean (SD) 4.73 (0.49) 4.74 (0.50) 4.73 (0.48)
Median (IQR) 4.71 (4.40, 5.05) 4.73 (4.41, 5.06) 4.71 (4.40, 5.04)
Range 2.55, 8.88 2.55, 6.49 2.58, 8.88
Sedimentation.rate
Mean (SD) 16 (11) 19 (13) 15 (11)
Median (IQR) 13 (7, 22) 16 (9, 27) 12 (7, 21)
Range 1, 65 1, 65 1, 61
Serum.Albumin
Mean (SD) 4.37 (0.33) 4.29 (0.31) 4.39 (0.33)
Median (IQR) 4.40 (4.20, 4.60) 4.30 (4.10, 4.50) 4.40 (4.20, 4.60)
Range 2.70, 5.60 2.70, 5.50 2.90, 5.60
Serum.Cholesterol
Mean (SD) 222 (50) 239 (49) 218 (49)
Median (IQR) 218 (188, 250) 236 (206, 268) 213 (185, 245)
Range 53, 793 89, 498 53, 793
Serum.Iron
Mean (SD) 101 (37) 101 (34) 101 (37)
Median (IQR) 96 (75, 121) 97 (77, 118) 95 (75, 121)
Range 17, 396 28, 274 17, 396
Serum.Magnesium
Mean (SD) 1.68 (0.14) 1.69 (0.15) 1.68 (0.14)
Median (IQR) 1.68 (1.59, 1.77) 1.70 (1.60, 1.79) 1.67 (1.59, 1.76)
Range 0.82, 2.70 1.13, 2.50 0.82, 2.70
Serum.Protein
Mean (SD) 7.10 (0.50) 7.09 (0.52) 7.10 (0.50)
Median (IQR) 7.10 (6.80, 7.40) 7.10 (6.70, 7.40) 7.10 (6.80, 7.40)
Range 4.40, 11.50 5.30, 11.50 4.40, 9.80
Sex
male 2,744 (39%) 640 (46%) 2,104 (38%)
female 4,245 (61%) 749 (54%) 3,496 (62%)
TIBC
Mean (SD) 363 (58) 345 (51) 367 (59)
Median (IQR) 356 (323, 396) 342 (311, 375) 360 (327, 399)
Range 168, 691 168, 565 175, 691
TS
Mean (SD) 28 (11) 30 (11) 28 (11)
Median (IQR) 27 (21, 34) 28 (23, 35) 27 (21, 34)
Range 3, 100 5, 96 3, 100
White.blood.cells
Mean (SD) 7.46 (2.29) 7.29 (2.72) 7.50 (2.17)
Median (IQR) 7.20 (6.00, 8.60) 7.00 (5.70, 8.30) 7.20 (6.10, 8.60)
Range 2.30, 56.00 2.30, 56.00 2.60, 51.20
BMI
Mean (SD) 25.7 (5.1) 26.0 (4.7) 25.6 (5.2)
Median (IQR) 24.9 (22.1, 28.3) 25.6 (22.8, 28.6) 24.8 (22.0, 28.2)
Range 12.9, 72.2 12.9, 47.2 14.0, 72.2
Pulse.pressure
Mean (SD) 51 (18) 65 (21) 48 (16)
Median (IQR) 48 (40, 60) 62 (50, 76) 45 (38, 56)
Range 10, 150 14, 150 10, 150
time_death
Mean (SD) 16.3 (5.5) 12.4 (6.2) 17.2 (4.8)
Median (IQR) 18.8 (13.8, 20.0) 13.0 (7.2, 18.5) 19.0 (16.4, 20.1)
Range 0.0, 21.8 0.0, 21.6 0.0, 21.8
status
0 4,488 (64%) 321 (23%) 4,167 (74%)
1 2,501 (36%) 1,068 (77%) 1,433 (26%)
death
0 5,888 (84%) 879 (63%) 5,009 (89%)
1 1,101 (16%) 510 (37%) 591 (11%)

1 c("Mean (SD)", "Median (IQR)", "Range"); n (%)

Which type of Missing do we have? The first approach is to make a graphical comparison between the observed and missing data in in each variable with boxplots and overlapping histograms for numeric variables. If the missing values are Missing Completely at Random we will expect that the distributions are identical, or at least comparable. Indeed in this case Missing data values do not relate to any other data in the dataset and there is no pattern to the actual values of the missing data themselves. If the distributions are different, we can have two cases:

  • MAR: missing data do have a relationship with other variables in the dataset.But, the actual values that are missing are random.
  • MNAR: The pattern of missingness is related to other variables in the dataset, but in addition, the values of the missing data are not random. Moreover the value of the variable that’s missing is related to the reason it’s missing.
#numeric variables
for (var in numeric_vars){
  hist.p<-gghistogram(survey, x = var, y ="..density.." ,add = "median", bins = 20, rug = TRUE, fill = "systolic_test",add_density = TRUE, alpha = 0.5,palette = c("Orange", "#bb44f0")) +theme_minimal()
  bx.p <- ggboxplot(survey, x="systolic_test", y=var, ylab = var, color ="systolic_test",palette = c("Orange", "#bb44f0"))+theme_minimal()
  tot.p <- grid.arrange(bx.p,hist.p, ncol = 2)

}

#categorical variable
barMR <- ggplot(data = survey, aes( x =Race, fill = systolic_test)) + 
  geom_bar(position = "fill") + 
  xlab("Race") +ylab("Relative Frequency")+
  ggtitle("Missings per Race") + 
  scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()
  
barMS <- ggplot(data = survey, aes( x = Sex, fill = systolic_test)) + 
  geom_bar(position = "fill") + 
  xlab("Sex") +ylab("Relative Frequency")+
  ggtitle("Missings per Sex") +  
  scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()

barMD <- ggplot(data = survey, aes( x = death, fill = systolic_test)) + 
  geom_bar(position = "fill") + 
  xlab("Death") +ylab("Relative Frequency")+
  ggtitle("Missings per death") +  
  scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()

barMSt <- ggplot(data = survey, aes( x = status, fill = systolic_test)) + 
  geom_bar(position = "fill") + 
  xlab("status") +ylab("Relative Frequency")+
  ggtitle("Missings per status") +  
  scale_fill_manual(values=c("Orange", "#bb44f0"))+theme_classic()

tot.p <- grid.arrange(barMR,barMS, barMSt, barMD, ncol = 2, nrow = 2) 

From these plots we can notice that missingness in Systolic.BP is not completely at random. On the contrary the missingness is related to other variables. They drastically differ by Age and are slightly related to higher values of Pulse.Pressure, Sedimentation.rate and lower value of Poverty.index. Moreover missigness is more present in patients that died within 10 years or experimented the event before the end of the study.

We can assess if the difference between observed and missing distributions is statistically significant performing Kruskal-Wallis test on continous data. In this tests:

  • \(H_0\): indicates equivalence between the two distributions (two sample are from identical population)
  • \(H_a\): the difference between distributions is statistically significant.
#kruskal-wallis
cont<- c("Age", "Pulse.pressure", "Sedimentation.rate", "Poverty.index")
categ<-c("death", "status", "Sex")

kruskal.test(Age ~ systolic_test, data = survey)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Age by systolic_test
## Kruskal-Wallis chi-squared = 2527, df = 1, p-value < 2.2e-16
kruskal.test(Pulse.pressure ~ systolic_test, data = survey)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Pulse.pressure by systolic_test
## Kruskal-Wallis chi-squared = 845.26, df = 1, p-value < 2.2e-16
kruskal.test(Sedimentation.rate ~ systolic_test, data = survey)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Sedimentation.rate by systolic_test
## Kruskal-Wallis chi-squared = 117.1, df = 1, p-value < 2.2e-16
kruskal.test(Poverty.index ~ systolic_test, data = survey)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Poverty.index by systolic_test
## Kruskal-Wallis chi-squared = 79.861, df = 1, p-value < 2.2e-16

For each test there is evidence to reject the null hypothesis since p-value << 0.05. Thus, we can conclude that there are differences between observed and missing distributions and so definitely missings are not MCAR. We have to choose between MAR and MNAR. MNAR would imply that there is a physiological reason that make the Systolic.BP measurement depend on Age, pressure or sedimentation, while from this test results only an association, relation. Since this hypothesis requires more clinical knowledge and we can’t have the opinion of a specialist, nevertheless we do not know how the study and population were carrried out and chosen, I choose to consider the missings as Missing at Random. We can only guess why missings are related to those independent variables. Maybe clinicians didn’t want to undergo severely ill old patients to other diagnostic tests Then we cannot ignore them but we have to find a way to handle these missings. Furthermore people without BP measures apparently have higher mortality rates (percentage of missings is higher in patients died within 10 years ) suggesting that eliminating the incomplete data will overestimate the true survival of the cohort.

Since for each patient we have other two measurement related to blood pressure, Pulse.Pressure and Diastolic pressure, we can assume that imputation of missing using that information could be a reasonable idea. But let’s verify if there is actually a correlation between these variables. We can plot graphically a heatmap based on Pearson correlation.

categoric_vars<-c('Race','Sex','status','death', 'time_death', 'systolic_test')
numeric_vars<- survey_trans %>% dplyr::select(-(all_of(categoric_vars))) %>% colnames()
numeric_vars
cormat<- round(x = cor(select(survey_trans, numeric_vars), use="complete.obs"), digits = 2)
melted_cormat <- reshape2::melt(cormat)

  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(cormat){
    cormat[lower.tri(cormat)]<- NA
    return(cormat)
  }
  upper_tri <- get_upper_tri(cormat)
  # Melt the correlation matrix

melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap

reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}

# Reorder the correlation matrix
cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- reshape2::melt(upper_tri, na.rm = TRUE)
# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
    name="Pearson\nCorrelation") +
  theme_minimal()+ # minimal theme
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 12, hjust = 1))+
 coord_fixed()
# Print the heatmap
ggheatmap+ geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))

As we guessed, the heatmap show the high correlation with Diastolic.BP and Pressure.BP. This suggest that imputation technique such as MICE, which internally use regression model and require MAR type, would lead to reasonably accurate results.

MICE: Data imputation

MICE(Multiple Imputation by Chained Equations). A brief overview on the algorithm: It reapeats 4 steps till convergence. At each cycle missing data are imputed: Steps: 1. Mean Imputation 2. Choose the variable with fewer missings, build a regression model to impute that variable using the others as predictors. 3. Replace missings with predictions 4. Move to other variables with missings.

#used as predictors only the variables that have correlation > 0.25 with Systolic.BP
pred_mat <- quickpred(survey_trans, mincor = 0.25)
imp <- mice(survey_trans, m=10, seed = 1, predictorMatrix = pred_mat) #m = number of datasets generated
#diagnostic plots
densityplot(imp)

bwplot(imp)

The densityplot() shows that the imputed values of Systolic.BP are higher than the observed values.According to bwplot() all the dataset produced are equivalent, thus I choose the first.

survey_mice<-mice::complete(imp,1)

# See the difference using the mean

survey_mean<- survey_trans
survey_mean$Systolic.BP[is.na(survey_mean$Systolic.BP)] <- mean(survey_mean$Systolic.BP, na.rm = TRUE) 
par(mfrow=c(1,2))
boxplot(survey_trans$Systolic.BP, survey_mice$Systolic.BP, 
        names = c("observed", "imputed"), col = c("orange","#bb44f0"), main = "MICE imputation")
boxplot(survey$Systolic.BP, survey_mean$Systolic.BP, 
        names = c("observed", "imputed"), col = c("orange","#bb44f0"), main = "Mean Imputation")

The mean imputation seems to reflect better the observed distribution, however it doesn’t take into account the relationship with the other variables. Furthermore we saw that The missings in Systolic.BP are related to higher value of Pressure and Diastolic, thus since are strictly correlated the imputation values can be higher.

Question 3: Univariable Regression

Perform univariable regression analyses, of all candidate predictors for your model.

Which regression?

  • The focus is on predicting the 10-year risk mortality not the time to event, thus the outcome variable is death, a binary variable.
  • We do not have to concern about censoring, since the first patient censored is after 10,03 years.
cens<- ggplot(survey_mice, aes(x = time_death, fill = status)) +
  geom_histogram(binwidth = 0.3, position = position_stack(reverse = TRUE))+xlab("Time to death(years)")+
  scale_fill_manual(values=c("Orange", "LightGrey"))+ geom_vline(xintercept = 10, linetype="dashed")+
  ggtitle("Censored patients")+
  theme_classic()
cens

Thus I decided to use Logistic Regression.

# note that here odds ratios are expressing variations by 1-unit on the time, in the summary function we had interquartile range as an effect 

survey_mice %>% dplyr::select(-c(status, systolic_test)) %>% 
  tbl_uvregression(method = glm, # glm function
                 method.args = list(family = binomial),# logistic model
                 exponentiate = T, # report OR
                 y=death,# outcome variable
                 conf.level = 0.95,
                 pvalue_fun = ~style_pvalue(.x, digits = 3)) %>%
  bold_labels() %>%
  bold_p()
Characteristic N OR1 95% CI1 p-value
Age 6989 1.10 1.09, 1.11 <0.001
Diastolic.BP 6989 1.03 1.02, 1.03 <0.001
Poverty.index.sqrt 6989 0.95 0.94, 0.96 <0.001
Race 6989
White — —
afro-american 1.45 1.23, 1.70 <0.001
other 1.44 0.80, 2.43 0.198
Red.blood.cells.log 6989 1.72 0.92, 3.22 0.088
Sedimentation.rate.log 6989 1.52 1.40, 1.66 <0.001
Serum.Albumin 6989 0.31 0.26, 0.38 <0.001
Serum.Cholesterol.log 6989 2.76 2.06, 3.72 <0.001
Serum.Iron.sqrt 6989 0.95 0.92, 0.98 0.005
Serum.Magnesium 6989 0.62 0.39, 0.97 0.036
Serum.Protein 6989 1.17 1.03, 1.32 0.018
Sex 6989
male — —
female 0.38 0.33, 0.43 <0.001
Systolic.BP 6989 1.03 1.02, 1.03 <0.001
TIBC 6989 1.00 0.99, 1.00 <0.001
TS 6989 1.00 1.00, 1.01 0.385
White.blood.cells.log 6989 1.33 1.05, 1.68 0.019
BMI.log 6989 1.19 0.84, 1.67 0.319
Pulse.pressure.sqrt 6989 1.73 1.64, 1.83 <0.001
time_death 6989 0.00 0.00, Inf 0.668

1 OR = Odds Ratio, CI = Confidence Interval

Considering this univariable regression, assuming that linearity of the effect for each predictors, crude odds ratio result in:

  • no effect between Poverty.index, Serum.Cholesterol, Serum.Iron, TIBC, TS, BMI and the outcome
  • a protective effect for Serum.Albumin, Serum.Magnesium, being female over male
  • greater risk with increase of Age, Diastolic.BP, Red.blood.cells, Sedimentation.rate, Serum.Protein, Systolic.BP, White.blood.cells, Pulse.pressure

Let’s know explore if linearity is a valid assumption for the effect of each of the predictors:

# Visual exploration of the linearity of the effect for continuous predictor
par(mfrow = c(3,3))
plsmo(survey_mice$Age,survey_mice$death)
plsmo(survey_mice$Diastolic.BP,survey_mice$death) #little strange
plsmo((survey_mice$Poverty.index.sqrt),survey_mice$death) #strange
plsmo(survey_mice$Red.blood.cells.log,survey_mice$death)# U shape
plsmo((survey_mice$Sedimentation.rate.log),survey_mice$death)
plsmo(survey_mice$Serum.Albumin,survey_mice$death)
plsmo(survey_mice$Serum.Cholesterol.log,survey_mice$death)
plsmo((survey_mice$Serum.Iron.sqrt),survey_mice$death)# little stranve
plsmo(survey_mice$Serum.Magnesium,survey_mice$death) #little strange

par(mfrow = c(3,3))
plsmo(survey_mice$Serum.Protein,survey_mice$death) #strange
plsmo((survey_mice$Systolic.BP),survey_mice$death)
plsmo(survey_mice$TIBC,survey_mice$death)
plsmo(survey_mice$TS,survey_mice$death)
plsmo((survey_mice$White.blood.cells.log),survey_mice$death)# strange
plsmo(survey_mice$BMI.log,survey_mice$death)# strange
plsmo(survey_mice$Pulse.pressure.sqrt,survey_mice$death)
plsmo(survey_mice$time_death,survey_mice$death)
plsmo(survey_mice$Age,survey_mice$death)

Let’s investigate if weird plots are associated with statistically significant non linear effect for the predictors. in order to check this, I used restricted cubic splines to interpolate non linear components and then test if these componenets are sttatistically significative looking at the p-value and also use the ANOVA, where \(H_0\) assumes linearity of the effect between log odds of outcome and predictors.

# non linearity between predictors and outcome
fit.splines.BMI <-  lrm(death ~ rcs(BMI.log,4), data=survey_mice)
print(fit.splines.BMI)
## Logistic Regression Model
##  
##  lrm(formula = death ~ rcs(BMI.log, 4), data = survey_mice)
##  
##                        Model Likelihood    Discrimination    Rank Discrim.    
##                              Ratio Test           Indexes          Indexes    
##  Obs          6989    LR chi2     20.20    R2       0.005    C       0.530    
##   0           5888    d.f.            3    g        0.160    Dxy     0.059    
##   1           1101    Pr(> chi2) 0.0002    gr       1.174    gamma   0.060    
##  max |deriv| 2e-12                         gp       0.021    tau-a   0.016    
##                                            Brier    0.132                     
##  
##            Coef     S.E.   Wald Z Pr(>|Z|)
##  Intercept   6.3983 2.0264  3.16  0.0016  
##  BMI.log    -2.6999 0.6704 -4.03  <0.0001 
##  BMI.log'   10.2807 2.3560  4.36  <0.0001 
##  BMI.log'' -32.2254 7.7819 -4.14  <0.0001 
## 
summary(fit.splines.BMI)
##              Effects              Response : death 
## 
##  Factor      Low    High   Diff.   Effect  S.E.    Lower 0.95 Upper 0.95
##  BMI.log     3.0974 3.3433 0.24586 0.31535 0.08657 0.14568    0.48503   
##   Odds Ratio 3.0974 3.3433 0.24586 1.37070      NA 1.15680    1.62420
anova(fit.splines.BMI)
##                 Wald Statistics          Response: death 
## 
##  Factor     Chi-Square d.f. P    
##  BMI.log    20.51      3    1e-04
##   Nonlinear 19.80      2    1e-04
##  TOTAL      20.51      3    1e-04
# non linearity between predictors and outcome
fit.splines.TS <-  lrm(death ~ rcs(TS,4), data=survey_mice)
#print(fit.splines.TS)
#summary(fit.splines.TS)
# non linearity between predictors and outcome
fit.splines.Protein <-  lrm(death ~ rcs(Serum.Protein,3), data=survey_mice)
#print(fit.splines.Protein)
#summary(fit.splines.Protein)
#anova(fit.splines.Protein)
# non linearity between predictors and outcome
fit.splines.Mag <-  lrm(death ~ rcs(Serum.Magnesium,3), data=survey_mice)
#print(fit.splines.Mag)
#summary(fit.splines.Mag)
#anova(fit.splines.Mag)
# non linearity between predictors and outcome
fit.splines.Iron <-  lrm(death ~ rcs(Serum.Iron.sqrt,6), data=survey_mice)
#print(fit.splines.Iron)
#summary(fit.splines.Iron)
#anova(fit.splines.Iron)
# non linearity between predictors and outcome
fit.splines.Chol <-  lrm(death ~ rcs(Serum.Cholesterol.log,4), data=survey_mice)
#print(fit.splines.Chol)
#summary(fit.splines.Chol)
# non linearity between predictors and outcome
fit.splines.Red <-  lrm(death ~ rcs(Red.blood.cells.log,3), data=survey_mice)
#print(fit.splines.Red)
#summary(fit.splines.Red)
#anova(fit.splines.Red)
# non linearity between predictors and outcome
fit.splines.Poverty <-  lrm(death ~ rcs(Poverty.index.sqrt,4), data=survey_mice)
#print(fit.splines.Poverty)
#summary(fit.splines.Poverty)
#anova(fit.splines.Poverty)
# non linearity between predictors and outcome
fit.splines.Diastolic <-  lrm(death ~ rcs(Diastolic.BP,4), data=survey_mice)
#print(fit.splines.Diastolic)
#summary(fit.splines.Diastolic)
#anova(fit.splines.Diastolic)
# non linearity between predictors and outcome
fit.splines.Sys <-  lrm(death ~ rcs(Systolic.BP,4), data=survey_mice)
#print(fit.splines.Sys)
#summary(fit.splines.Sys)
#anova(fit.splines.Sys)
# non linearity between predictors and outcome
fit.splines.Pres <-  lrm(death ~ rcs(Pulse.pressure.sqrt,4), data=survey_mice)
# print(fit.splines.Pres)
# summary(fit.splines.Pres)
# anova(fit.splines.Pres)
# non linearity between predictors and outcome
fit.splines.TS <-  lrm(death ~ rcs(TS,4), data=survey_mice)
# print(fit.splines.TS)
# summary(fit.splines.TS)
# anova(fit.splines.TS)

From these tests result statistically significant the non-linear effect for: Poverty.index, Systolic.BP, Pressure.pulse, Diastolic.BP, Red.blood.cells,Iron,BMI, Magnesium

#stima effetto della variazione delle covariate sul rischio di morte (su scala log odds)
#stima effetto della variazione delle covariate sul rischio di morte (su scala log odds)

a<-ggplot(Predict(fit.splines.Poverty), colfill = "Orange")+theme_classic()
b<-ggplot(Predict(fit.splines.Sys),colfill = "Orange")+theme_classic()
c<-ggplot(Predict(fit.splines.Pres),colfill = "Orange")+theme_classic()
d<-ggplot(Predict(fit.splines.Diastolic),colfill = "Orange")+theme_classic()
e<-ggplot(Predict(fit.splines.Red),colfill = "Orange")+theme_classic()
f<-ggplot(Predict(fit.splines.Iron),colfill = "Orange")+theme_classic()
g<-ggplot(Predict(fit.splines.BMI),colfill = "Orange")+theme_classic()
h<-ggplot(Predict(fit.splines.Mag),colfill = "Orange")+theme_classic()
i<-ggplot(Predict(fit.splines.Protein), colfill = "Orange")+theme_classic()

tot.p <- grid.arrange(a,b,c,d,e,f,g,h, i, ncol = 3, nrow = 3) 

Question 4: Multivariable Regression

Build a multivariable model. Various selection procedures of the candidate predictors could be performed (try different approaches and compare them… we don’t have here the expert’s opinion, therefore the final choice will be based only on statistical evaluations). Comment on the effect estimated for the predictors retained in the final multivariable model

I fit several models using both implemented algorithms (BACKWARD, STEPWISE,HYBRID) and selection of predictors by hand looking at AIC, statistical significance, log-likelihood ratio test. I will consider here only the 4 best models.

Backward elimination

Backward by hand

# Full model considering the non linear effect
# Full model considering the non linear effect
fit.multi.lrm<-lrm(death ~ Age+rcs(Diastolic.BP,4)+rcs(Poverty.index.sqrt,4)+Race+ rcs(Red.blood.cells.log,3)+
                     Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ rcs(Serum.Iron.sqrt,6)+rcs(Serum.Magnesium,3)+rcs(Serum.Protein,3)+
                     Sex+TIBC+TS+White.blood.cells.log+rcs(BMI.log,4)+rcs(Pulse.pressure.sqrt,4) ,data = survey_mice, y=T, x=T)
print(fit.multi.lrm) #adjusted odds ratio
## Logistic Regression Model
##  
##  lrm(formula = death ~ Age + rcs(Diastolic.BP, 4) + rcs(Poverty.index.sqrt, 
##      4) + Race + rcs(Red.blood.cells.log, 3) + Sedimentation.rate.log + 
##      Serum.Albumin + Serum.Cholesterol.log + rcs(Serum.Iron.sqrt, 
##      6) + rcs(Serum.Magnesium, 3) + rcs(Serum.Protein, 3) + Sex + 
##      TIBC + TS + White.blood.cells.log + rcs(BMI.log, 4) + rcs(Pulse.pressure.sqrt, 
##      4), data = survey_mice, x = T, y = T)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs          6989    LR chi2    1716.34    R2       0.374    C       0.858    
##   0           5888    d.f.            33    g        1.986    Dxy     0.716    
##   1           1101    Pr(> chi2) <0.0001    gr       7.287    gamma   0.716    
##  max |deriv| 6e-08                          gp       0.190    tau-a   0.190    
##                                             Brier    0.098                     
##  
##                         Coef     S.E.    Wald Z Pr(>|Z|)
##  Intercept                9.9872  3.2930   3.03 0.0024  
##  Age                      0.0894  0.0042  21.48 <0.0001 
##  Diastolic.BP            -0.0208  0.0130  -1.60 0.1096  
##  Diastolic.BP'            0.0651  0.0434   1.50 0.1341  
##  Diastolic.BP''          -0.1457  0.1360  -1.07 0.2840  
##  Poverty.index.sqrt      -0.0208  0.0291  -0.71 0.4749  
##  Poverty.index.sqrt'     -0.1338  0.1376  -0.97 0.3307  
##  Poverty.index.sqrt''     0.4932  0.4070   1.21 0.2255  
##  Race=afro-american       0.0011  0.1157   0.01 0.9924  
##  Race=other               0.4355  0.3476   1.25 0.2102  
##  Red.blood.cells.log     -2.6123  0.7551  -3.46 0.0005  
##  Red.blood.cells.log'     3.5823  0.8772   4.08 <0.0001 
##  Sedimentation.rate.log   0.2099  0.0559   3.75 0.0002  
##  Serum.Albumin           -0.5375  0.1477  -3.64 0.0003  
##  Serum.Cholesterol.log   -0.1042  0.1919  -0.54 0.5871  
##  Serum.Iron.sqrt          0.1311  0.1379   0.95 0.3416  
##  Serum.Iron.sqrt'        -3.4756  1.0553  -3.29 0.0010  
##  Serum.Iron.sqrt''       23.5598  7.0367   3.35 0.0008  
##  Serum.Iron.sqrt'''     -44.0801 14.5286  -3.03 0.0024  
##  Serum.Iron.sqrt''''     33.5164 13.3426   2.51 0.0120  
##  Serum.Magnesium         -1.5536  0.5097  -3.05 0.0023  
##  Serum.Magnesium'         0.9292  0.5702   1.63 0.1031  
##  Serum.Protein           -0.0301  0.1807  -0.17 0.8677  
##  Serum.Protein'           0.2812  0.1823   1.54 0.1229  
##  Sex=female              -1.0573  0.0898 -11.77 <0.0001 
##  TIBC                     0.0028  0.0016   1.74 0.0814  
##  TS                       0.0172  0.0187   0.92 0.3572  
##  White.blood.cells.log    0.7833  0.1502   5.22 <0.0001 
##  BMI.log                 -2.7273  0.8112  -3.36 0.0008  
##  BMI.log'                 4.6291  2.7523   1.68 0.0926  
##  BMI.log''               -9.9192  9.0290  -1.10 0.2719  
##  Pulse.pressure.sqrt     -0.1290  0.1687  -0.76 0.4446  
##  Pulse.pressure.sqrt'     0.3395  0.5979   0.57 0.5701  
##  Pulse.pressure.sqrt''   -0.4635  1.7710  -0.26 0.7935  
## 
s <- summary(fit.multi.lrm)

fit.multi.glm<-glm(death ~ Age+ns(Diastolic.BP,4)+ns(Poverty.index.sqrt,4)+Race+ ns(Red.blood.cells.log,3)+
                     Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ ns(Serum.Iron.sqrt,6)+ns(Serum.Magnesium,3)+ns(Serum.Protein,3)+
                     Sex+ns(Systolic.BP,4)+TIBC+TS+White.blood.cells.log+ns(BMI.log,4)+ns(Pulse.pressure.sqrt,3) ,family = binomial, data = survey_mice)

We should consider the correlation between variables. According to heatmap made we have variables that are highly corralted such as Systolic.BP, Diastolic.BP,PressurePulse. In this case we can have an issue of collinearity to take into account. We can test it using VIF(Variance Inflation factor). The smallest value for VIF is 1, which indicates complete absence of multicollinearity. As a rule of thumb, a VIF vaue that exceeds 5 or 10 indicates a problematic amount of collinearity.

vif_values <- vif(fit.multi.lrm)
vif_values
##                    Age           Diastolic.BP          Diastolic.BP' 
##               1.428483              22.837245             260.245029 
##         Diastolic.BP''     Poverty.index.sqrt    Poverty.index.sqrt' 
##             153.584416              22.309495             301.220465 
##   Poverty.index.sqrt''     Race=afro-american             Race=other 
##             188.119937               1.393221               1.022300 
##    Red.blood.cells.log   Red.blood.cells.log' Sedimentation.rate.log 
##               4.821616               4.520868               1.524514 
##          Serum.Albumin  Serum.Cholesterol.log        Serum.Iron.sqrt 
##               1.546356               1.154070              39.247834 
##       Serum.Iron.sqrt'      Serum.Iron.sqrt''     Serum.Iron.sqrt''' 
##            2548.466712           17059.507125           17024.966107 
##    Serum.Iron.sqrt''''        Serum.Magnesium       Serum.Magnesium' 
##            2055.354192               4.227201               4.067548 
##          Serum.Protein         Serum.Protein'             Sex=female 
##               6.032579               5.254931               1.369111 
##                   TIBC                     TS  White.blood.cells.log 
##               5.199133              29.506607               1.125166 
##                BMI.log               BMI.log'              BMI.log'' 
##              16.102262             165.680213              98.447754 
##    Pulse.pressure.sqrt   Pulse.pressure.sqrt'  Pulse.pressure.sqrt'' 
##              32.935570             428.295194             254.721086
fit.multi.glm %>%
  tbl_regression(exponentiate = TRUE) %>%
  bold_labels() %>%
  bold_p %>%
  add_vif() #adjusted GVIF should be < 2
Characteristic OR1 95% CI1 p-value GVIF1 Adjusted GVIF2,1
Age 1.09 1.08, 1.10 <0.001 1.5 1.2
ns(Diastolic.BP, 4) 208 1.9
ns(Diastolic.BP, 4)1 0.04 0.00, 1.12 0.057
ns(Diastolic.BP, 4)2 0.03 0.00, 1.76 0.092
ns(Diastolic.BP, 4)3 0.00 0.00, 9.75 0.15
ns(Diastolic.BP, 4)4 0.03 0.00, 31.7 0.3
ns(Poverty.index.sqrt, 4) 1.2 1.0
ns(Poverty.index.sqrt, 4)1 0.86 0.40, 1.85 0.7
ns(Poverty.index.sqrt, 4)2 0.62 0.34, 1.14 0.12
ns(Poverty.index.sqrt, 4)3 1.09 0.18, 6.96 >0.9
ns(Poverty.index.sqrt, 4)4 0.62 0.40, 0.96 0.034
Race 1.4 1.1
White — —
afro-american 1.00 0.80, 1.26 >0.9
other 1.59 0.79, 3.09 0.2
ns(Red.blood.cells.log, 3) 1.7 1.1
ns(Red.blood.cells.log, 3)1 0.29 0.12, 0.67 0.004
ns(Red.blood.cells.log, 3)2 0.19 0.00, 9.36 0.4
ns(Red.blood.cells.log, 3)3 11.3 0.88, 165 0.065
Sedimentation.rate.log 1.22 1.09, 1.37 <0.001 1.5 1.2
Serum.Albumin 0.57 0.43, 0.76 <0.001 1.6 1.3
Serum.Cholesterol.log 0.92 0.63, 1.34 0.6 1.2 1.1
ns(Serum.Iron.sqrt, 6) 31 1.3
ns(Serum.Iron.sqrt, 6)1 1.50 0.39, 6.36 0.6
ns(Serum.Iron.sqrt, 6)2 1.09 0.23, 5.72 >0.9
ns(Serum.Iron.sqrt, 6)3 1.59 0.33, 8.39 0.6
ns(Serum.Iron.sqrt, 6)4 0.25 0.04, 1.66 0.15
ns(Serum.Iron.sqrt, 6)5 4.84 0.08, 344 0.5
ns(Serum.Iron.sqrt, 6)6 2.45 0.03, 197 0.7
ns(Serum.Magnesium, 3) 1.1 1.0
ns(Serum.Magnesium, 3)1 0.39 0.17, 0.93 0.029
ns(Serum.Magnesium, 3)2 1.60 0.05, 63.5 0.8
ns(Serum.Magnesium, 3)3 2.25 0.36, 13.1 0.4
ns(Serum.Protein, 3) 1.7 1.1
ns(Serum.Protein, 3)1 1.23 0.47, 3.25 0.7
ns(Serum.Protein, 3)2 0.14 0.00, 11.1 0.4
ns(Serum.Protein, 3)3 0.86 0.04, 29.1 >0.9
Sex 1.4 1.2
male — —
female 0.34 0.29, 0.41 <0.001
ns(Systolic.BP, 4) 1,432 2.5
ns(Systolic.BP, 4)1 14.0 0.60, 334 0.10
ns(Systolic.BP, 4)2 120 0.88, 16,835 0.057
ns(Systolic.BP, 4)3 1,347 0.14, 13,912,837 0.12
ns(Systolic.BP, 4)4 358 0.09, 1,565,476 0.2
TIBC 1.00 1.00, 1.01 0.084 5.4 2.3
TS 1.02 0.98, 1.05 0.4 30 5.5
White.blood.cells.log 2.18 1.62, 2.93 <0.001 1.1 1.1
ns(BMI.log, 4) 1.4 1.0
ns(BMI.log, 4)1 0.16 0.04, 0.56 0.005
ns(BMI.log, 4)2 0.37 0.15, 0.91 0.032
ns(BMI.log, 4)3 0.05 0.00, 0.87 0.042
ns(BMI.log, 4)4 0.31 0.04, 2.00 0.2
ns(Pulse.pressure.sqrt, 3) 507 2.8
ns(Pulse.pressure.sqrt, 3)1 0.03 0.00, 0.87 0.041
ns(Pulse.pressure.sqrt, 3)2 0.00 0.00, 3.09 0.10
ns(Pulse.pressure.sqrt, 3)3 0.02 0.00, 8.14 0.2

1 OR = Odds Ratio, CI = Confidence Interval, GVIF = Generalized Variance Inflation Factor

2 GVIF^[1/(2*df)]

As we expected there is a problem of collinearity between three measurements referred to blood pressure. Let’s drop out of Systolic.BP and TS since have the greatest VIF. Then compare the model with ANOVA, where \(H_0\) indicates that simpler model is better.

#removed predictor with highest VIF: TS
fit.multi.glm_ts<-glm(death ~ Age+ns(Diastolic.BP,4)+ns(Poverty.index.sqrt,3)+Race+ ns(Red.blood.cells.log,4)+
                     Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ ns(Serum.Iron.sqrt,6)+ns(Serum.Magnesium,3)+Serum.Protein+
                     Sex+ns(Systolic.BP,4)+TIBC+White.blood.cells.log+ns(BMI.log,4)+ns(Pulse.pressure.sqrt,3) ,family = binomial, data = survey_mice)

fit.multi.glm_ts
## 
## Call:  glm(formula = death ~ Age + ns(Diastolic.BP, 4) + ns(Poverty.index.sqrt, 
##     3) + Race + ns(Red.blood.cells.log, 4) + Sedimentation.rate.log + 
##     Serum.Albumin + Serum.Cholesterol.log + ns(Serum.Iron.sqrt, 
##     6) + ns(Serum.Magnesium, 3) + Serum.Protein + Sex + ns(Systolic.BP, 
##     4) + TIBC + White.blood.cells.log + ns(BMI.log, 4) + ns(Pulse.pressure.sqrt, 
##     3), family = binomial, data = survey_mice)
## 
## Coefficients:
##                 (Intercept)                          Age  
##                   -0.981747                     0.089018  
##        ns(Diastolic.BP, 4)1         ns(Diastolic.BP, 4)2  
##                   -3.356232                    -3.618972  
##        ns(Diastolic.BP, 4)3         ns(Diastolic.BP, 4)4  
##                   -6.597642                    -3.558118  
##  ns(Poverty.index.sqrt, 3)1   ns(Poverty.index.sqrt, 3)2  
##                   -0.708969                    -0.529594  
##  ns(Poverty.index.sqrt, 3)3            Raceafro-american  
##                   -0.499937                     0.008943  
##                   Raceother  ns(Red.blood.cells.log, 4)1  
##                    0.461441                    -2.044900  
## ns(Red.blood.cells.log, 4)2  ns(Red.blood.cells.log, 4)3  
##                   -1.194556                    -1.825077  
## ns(Red.blood.cells.log, 4)4       Sedimentation.rate.log  
##                    2.275725                     0.202774  
##               Serum.Albumin        Serum.Cholesterol.log  
##                   -0.574562                    -0.096457  
##     ns(Serum.Iron.sqrt, 6)1      ns(Serum.Iron.sqrt, 6)2  
##                    0.612242                     0.365352  
##     ns(Serum.Iron.sqrt, 6)3      ns(Serum.Iron.sqrt, 6)4  
##                    0.816055                    -0.651667  
##     ns(Serum.Iron.sqrt, 6)5      ns(Serum.Iron.sqrt, 6)6  
##                    2.831390                     2.525582  
##     ns(Serum.Magnesium, 3)1      ns(Serum.Magnesium, 3)2  
##                   -0.947189                     0.441262  
##     ns(Serum.Magnesium, 3)3                Serum.Protein  
##                    0.803462                     0.218432  
##                   Sexfemale          ns(Systolic.BP, 4)1  
##                   -1.060542                     2.647317  
##         ns(Systolic.BP, 4)2          ns(Systolic.BP, 4)3  
##                    4.818476                     7.258152  
##         ns(Systolic.BP, 4)4                         TIBC  
##                    5.949845                     0.001558  
##       White.blood.cells.log              ns(BMI.log, 4)1  
##                    0.777769                    -1.816743  
##             ns(BMI.log, 4)2              ns(BMI.log, 4)3  
##                   -0.985700                    -2.922636  
##             ns(BMI.log, 4)4  ns(Pulse.pressure.sqrt, 3)1  
##                   -1.131883                    -3.388004  
## ns(Pulse.pressure.sqrt, 3)2  ns(Pulse.pressure.sqrt, 3)3  
##                   -5.741128                    -4.075939  
## 
## Degrees of Freedom: 6988 Total (i.e. Null);  6947 Residual
## Null Deviance:       6088 
## Residual Deviance: 4362  AIC: 4446
anova( fit.multi.glm_ts, fit.multi.glm, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: death ~ Age + ns(Diastolic.BP, 4) + ns(Poverty.index.sqrt, 3) + 
##     Race + ns(Red.blood.cells.log, 4) + Sedimentation.rate.log + 
##     Serum.Albumin + Serum.Cholesterol.log + ns(Serum.Iron.sqrt, 
##     6) + ns(Serum.Magnesium, 3) + Serum.Protein + Sex + ns(Systolic.BP, 
##     4) + TIBC + White.blood.cells.log + ns(BMI.log, 4) + ns(Pulse.pressure.sqrt, 
##     3)
## Model 2: death ~ Age + ns(Diastolic.BP, 4) + ns(Poverty.index.sqrt, 4) + 
##     Race + ns(Red.blood.cells.log, 3) + Sedimentation.rate.log + 
##     Serum.Albumin + Serum.Cholesterol.log + ns(Serum.Iron.sqrt, 
##     6) + ns(Serum.Magnesium, 3) + ns(Serum.Protein, 3) + Sex + 
##     ns(Systolic.BP, 4) + TIBC + TS + White.blood.cells.log + 
##     ns(BMI.log, 4) + ns(Pulse.pressure.sqrt, 3)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      6947     4362.1                     
## 2      6944     4357.3  3   4.8264    0.185
# Log likelihood ratio test: H_0 simpler model captuers variance better, keep it, H_a prefer more complex model
#If the resulting p-value is sufficiently low (usually less than 0.05), we conclude that the more complex model 
#is significantly better than the simpler model

According to test (ANOVA, smaller AIC) we should consider the simpler model.

Considering ANOVA results, reduction in AIC, VIF and statistical significance of coefficients I ended up with this model, that is the full model without Systolic.BP, TS, Race.

fit.multi.glm_res<-glm(death ~ Age+ns(Diastolic.BP,4)+ns(Poverty.index.sqrt,3)+ns(Red.blood.cells.log,4)+
                     Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ ns(Serum.Iron.sqrt,6)+ns(Serum.Magnesium,3)+ns(Serum.Protein,3)+
                     Sex+TIBC+White.blood.cells.log+ns(BMI.log,4)+ns(Pulse.pressure.sqrt,3) ,family = binomial, data = survey_mice)

anova( fit.multi.glm_ts, fit.multi.glm, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: death ~ Age + ns(Diastolic.BP, 4) + ns(Poverty.index.sqrt, 3) + 
##     Race + ns(Red.blood.cells.log, 4) + Sedimentation.rate.log + 
##     Serum.Albumin + Serum.Cholesterol.log + ns(Serum.Iron.sqrt, 
##     6) + ns(Serum.Magnesium, 3) + Serum.Protein + Sex + ns(Systolic.BP, 
##     4) + TIBC + White.blood.cells.log + ns(BMI.log, 4) + ns(Pulse.pressure.sqrt, 
##     3)
## Model 2: death ~ Age + ns(Diastolic.BP, 4) + ns(Poverty.index.sqrt, 4) + 
##     Race + ns(Red.blood.cells.log, 3) + Sedimentation.rate.log + 
##     Serum.Albumin + Serum.Cholesterol.log + ns(Serum.Iron.sqrt, 
##     6) + ns(Serum.Magnesium, 3) + ns(Serum.Protein, 3) + Sex + 
##     ns(Systolic.BP, 4) + TIBC + TS + White.blood.cells.log + 
##     ns(BMI.log, 4) + ns(Pulse.pressure.sqrt, 3)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      6947     4362.1                     
## 2      6944     4357.3  3   4.8264    0.185
fit.multi.lrm_res<-lrm(death ~ Age+Poverty.index.sqrt+rcs(Red.blood.cells.log,4)+
                     Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ rcs(Serum.Iron.sqrt,6)+I(Serum.Magnesium^2)+Serum.Protein+
                     Sex+TIBC+White.blood.cells.log+rcs(BMI.log,3)+Pulse.pressure.sqrt^2 ,data = survey_mice, y=T, x=T)
summary(fit.multi.lrm_res)
##              Effects              Response : death 
## 
##  Factor                 Low      High     Diff.    Effect    S.E.    
##  Age                     35.0000  66.0000 31.00000  2.737500 0.124690
##   Odds Ratio             35.0000  66.0000 31.00000 15.448000       NA
##  Poverty.index.sqrt      11.7050  19.3130  7.60850 -0.235200 0.049413
##   Odds Ratio             11.7050  19.3130  7.60850  0.790410       NA
##  Red.blood.cells.log      1.4816   1.6194  0.13778 -0.055147 0.109740
##   Odds Ratio              1.4816   1.6194  0.13778  0.946350       NA
##  Sedimentation.rate.log   1.9459   3.0910  1.14510  0.253900 0.063756
##   Odds Ratio              1.9459   3.0910  1.14510  1.289000       NA
##  Serum.Albumin            4.2000   4.6000  0.40000 -0.245050 0.057647
##   Odds Ratio              4.2000   4.6000  0.40000  0.782670       NA
##  Serum.Cholesterol.log    5.2364   5.5215  0.28502 -0.037934 0.054452
##   Odds Ratio              5.2364   5.5215  0.28502  0.962780       NA
##  Serum.Iron.sqrt          8.6603  11.0000  2.33970 -0.219010 0.102470
##   Odds Ratio              8.6603  11.0000  2.33970  0.803310       NA
##  Serum.Magnesium          1.5900   1.7700  0.18000 -0.144290 0.046492
##   Odds Ratio              1.5900   1.7700  0.18000  0.865640       NA
##  Serum.Protein            6.8000   7.4000  0.60000  0.157000 0.052595
##   Odds Ratio              6.8000   7.4000  0.60000  1.170000       NA
##  TIBC                   323.0000 396.0000 73.00000  0.117080 0.055530
##   Odds Ratio            323.0000 396.0000 73.00000  1.124200       NA
##  White.blood.cells.log    1.7918   2.1518  0.36000  0.274240 0.052523
##   Odds Ratio              1.7918   2.1518  0.36000  1.315500       NA
##  BMI.log                  3.0974   3.3433  0.24586 -0.190410 0.054730
##   Odds Ratio              3.0974   3.3433  0.24586  0.826620       NA
##  Pulse.pressure.sqrt      6.3246   7.7460  1.42140  0.155690 0.047212
##   Odds Ratio              6.3246   7.7460  1.42140  1.168500       NA
##  Sex - male:female        2.0000   1.0000       NA  1.074500 0.089690
##   Odds Ratio              2.0000   1.0000       NA  2.928700       NA
##  Lower 0.95 Upper 0.95
##   2.4931000  2.981900 
##  12.0990000 19.725000 
##  -0.3320500 -0.138360 
##   0.7174500  0.870790 
##  -0.2702200  0.159930 
##   0.7632100  1.173400 
##   0.1289400  0.378860 
##   1.1376000  1.460600 
##  -0.3580300 -0.132060 
##   0.6990500  0.876290 
##  -0.1446600  0.068790 
##   0.8653200  1.071200 
##  -0.4198400 -0.018178 
##   0.6571500  0.981990 
##  -0.2354100 -0.053165 
##   0.7902500  0.948220 
##   0.0539180  0.260090 
##   1.0554000  1.297000 
##   0.0082415  0.225920 
##   1.0083000  1.253500 
##   0.1713000  0.377190 
##   1.1868000  1.458200 
##  -0.2976800 -0.083139 
##   0.7425400  0.920220 
##   0.0631560  0.248220 
##   1.0652000  1.281700 
##   0.8987600  1.250300 
##   2.4566000  3.491500
fit.multi.lrm_res
## Logistic Regression Model
##  
##  lrm(formula = death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log, 
##      4) + Sedimentation.rate.log + Serum.Albumin + Serum.Cholesterol.log + 
##      rcs(Serum.Iron.sqrt, 6) + I(Serum.Magnesium^2) + Serum.Protein + 
##      Sex + TIBC + White.blood.cells.log + rcs(BMI.log, 3) + Pulse.pressure.sqrt^2, 
##      data = survey_mice, x = T, y = T)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs          6989    LR chi2    1680.79    R2       0.368    C       0.855    
##   0           5888    d.f.            21    g        1.953    Dxy     0.709    
##   1           1101    Pr(> chi2) <0.0001    gr       7.047    gamma   0.710    
##  max |deriv| 3e-08                          gp       0.188    tau-a   0.188    
##                                             Brier    0.098                     
##  
##                         Coef     S.E.    Wald Z Pr(>|Z|)
##  Intercept                2.6977  2.3308   1.16 0.2471  
##  Age                      0.0883  0.0040  21.95 <0.0001 
##  Poverty.index.sqrt      -0.0309  0.0065  -4.76 <0.0001 
##  Red.blood.cells.log     -2.5337  1.0658  -2.38 0.0174  
##  Red.blood.cells.log'     2.0533  3.1998   0.64 0.5211  
##  Red.blood.cells.log''    6.3288 14.1235   0.45 0.6541  
##  Sedimentation.rate.log   0.2217  0.0557   3.98 <0.0001 
##  Serum.Albumin           -0.6126  0.1441  -4.25 <0.0001 
##  Serum.Cholesterol.log   -0.1331  0.1910  -0.70 0.4860  
##  Serum.Iron.sqrt          0.2151  0.1264   1.70 0.0888  
##  Serum.Iron.sqrt'        -3.5726  1.0361  -3.45 0.0006  
##  Serum.Iron.sqrt''       24.3320  6.9831   3.48 0.0005  
##  Serum.Iron.sqrt'''     -45.7553 14.4523  -3.17 0.0015  
##  Serum.Iron.sqrt''''     35.4162 13.2701   2.67 0.0076  
##  Serum.Magnesium         -0.2386  0.0769  -3.10 0.0019  
##  Serum.Protein            0.2617  0.0877   2.99 0.0028  
##  Sex=female              -1.0745  0.0897 -11.98 <0.0001 
##  TIBC                     0.0016  0.0008   2.11 0.0350  
##  White.blood.cells.log    0.7618  0.1459   5.22 <0.0001 
##  BMI.log                 -2.3257  0.4892  -4.75 <0.0001 
##  BMI.log'                 2.3391  0.5656   4.14 <0.0001 
##  Pulse.pressure.sqrt      0.1095  0.0332   3.30 0.0010  
## 
anova(fit.multi.lrm_res)
##                 Wald Statistics          Response: death 
## 
##  Factor                 Chi-Square d.f. P     
##  Age                     482.01     1   <.0001
##  Poverty.index.sqrt       22.66     1   <.0001
##  Red.blood.cells.log      21.16     3   0.0001
##   Nonlinear               21.11     2   <.0001
##  Sedimentation.rate.log   15.86     1   0.0001
##  Serum.Albumin            18.07     1   <.0001
##  Serum.Cholesterol.log     0.49     1   0.4860
##  Serum.Iron.sqrt          21.63     5   0.0006
##   Nonlinear               18.43     4   0.0010
##  Serum.Magnesium           9.63     1   0.0019
##  Serum.Protein             8.91     1   0.0028
##  Sex                     143.54     1   <.0001
##  TIBC                      4.45     1   0.0350
##  White.blood.cells.log    27.26     1   <.0001
##  BMI.log                  22.71     2   <.0001
##   Nonlinear               17.10     1   <.0001
##  Pulse.pressure.sqrt      10.87     1   0.0010
##  TOTAL NONLINEAR          58.00     7   <.0001
##  TOTAL                  1014.11    21   <.0001

Backward algorithm: fastbw()

#### BACKWARD ELIMINATION ALGORITHM
fastbw(fit.multi.lrm)
## 
##  Deleted               Chi-Sq d.f. P      Residual d.f. P      AIC  
##  Race                  1.58   2    0.4549 1.58     2    0.4549 -2.42
##  Serum.Cholesterol.log 0.32   1    0.5699 1.90     3    0.5938 -4.10
##  TS                    0.86   1    0.3533 2.76     4    0.5988 -5.24
##  TIBC                  3.68   1    0.0550 6.44     5    0.2655 -3.56
## 
## Approximate Estimates after Deleting Factors
## 
##                             Coef      S.E.    Wald Z         P
## Intercept               10.43670  3.199911   3.26156 1.108e-03
## Age                      0.08741  0.004001  21.84560 0.000e+00
## Diastolic.BP            -0.02094  0.013008  -1.60996 1.074e-01
## Diastolic.BP'            0.06592  0.043402   1.51892 1.288e-01
## Diastolic.BP''          -0.14817  0.135936  -1.09002 2.757e-01
## Poverty.index.sqrt      -0.01854  0.028792  -0.64400 5.196e-01
## Poverty.index.sqrt'     -0.14524  0.137256  -1.05817 2.900e-01
## Poverty.index.sqrt''     0.52641  0.406531   1.29489 1.954e-01
## Red.blood.cells.log     -2.69156  0.749768  -3.58986 3.309e-04
## Red.blood.cells.log'     3.70029  0.873306   4.23711 2.264e-05
## Sedimentation.rate.log   0.20586  0.055632   3.70039 2.153e-04
## Serum.Albumin           -0.51157  0.143972  -3.55327 3.805e-04
## Serum.Iron.sqrt          0.14783  0.125163   1.18106 2.376e-01
## Serum.Iron.sqrt'        -3.15297  1.035712  -3.04426 2.333e-03
## Serum.Iron.sqrt''       22.46413  7.003681   3.20747 1.339e-03
## Serum.Iron.sqrt'''     -43.08006 14.515197  -2.96793 2.998e-03
## Serum.Iron.sqrt''''     33.41291 13.336965   2.50529 1.224e-02
## Serum.Magnesium         -1.56088  0.504763  -3.09231 1.986e-03
## Serum.Magnesium'         0.92668  0.567957   1.63160 1.028e-01
## Serum.Protein           -0.01414  0.179638  -0.07871 9.373e-01
## Serum.Protein'           0.27909  0.181994   1.53353 1.251e-01
## Sex=female              -1.04347  0.087320 -11.95003 0.000e+00
## White.blood.cells.log    0.79061  0.146870   5.38304 7.324e-08
## BMI.log                 -2.66486  0.807979  -3.29819 9.731e-04
## BMI.log'                 4.43223  2.747111   1.61341 1.067e-01
## BMI.log''               -9.27341  9.013034  -1.02889 3.035e-01
## Pulse.pressure.sqrt     -0.13758  0.168511  -0.81645 4.142e-01
## Pulse.pressure.sqrt'     0.37838  0.597317   0.63346 5.264e-01
## Pulse.pressure.sqrt''   -0.56603  1.769770  -0.31983 7.491e-01
## 
## Factors in Final Model
## 
##  [1] Age                    Diastolic.BP           Poverty.index.sqrt    
##  [4] Red.blood.cells.log    Sedimentation.rate.log Serum.Albumin         
##  [7] Serum.Iron.sqrt        Serum.Magnesium        Serum.Protein         
## [10] Sex                    White.blood.cells.log  BMI.log               
## [13] Pulse.pressure.sqrt
#After several tests redcing the non linearity of Poverty and Iron we obtain a better model
fit.bw1<- lrm(death~Age+ Poverty.index.sqrt+rcs(Red.blood.cells.log,3)+
                Sedimentation.rate.log+Serum.Albumin+rcs(Serum.Magnesium,3)+Systolic.BP+Serum.Protein+Sex+rcs(BMI.log,3)+
                rcs(Serum.Iron.sqrt,6)+White.blood.cells.log, data = survey_mice, y = TRUE, x=TRUE)
fit.bw1
## Logistic Regression Model
##  
##  lrm(formula = death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log, 
##      3) + Sedimentation.rate.log + Serum.Albumin + rcs(Serum.Magnesium, 
##      3) + Systolic.BP + Serum.Protein + Sex + rcs(BMI.log, 3) + 
##      rcs(Serum.Iron.sqrt, 6) + White.blood.cells.log, data = survey_mice, 
##      x = TRUE, y = TRUE)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs          6989    LR chi2    1690.00    R2       0.369    C       0.856    
##   0           5888    d.f.            19    g        1.980    Dxy     0.712    
##   1           1101    Pr(> chi2) <0.0001    gr       7.240    gamma   0.712    
##  max |deriv| 1e-08                          gp       0.189    tau-a   0.189    
##                                             Brier    0.098                     
##  
##                         Coef     S.E.    Wald Z Pr(>|Z|)
##  Intercept                5.4151  2.1170   2.56 0.0105  
##  Age                      0.0862  0.0038  22.58 <0.0001 
##  Poverty.index.sqrt      -0.0295  0.0065  -4.55 <0.0001 
##  Red.blood.cells.log     -2.8444  0.7427  -3.83 0.0001  
##  Red.blood.cells.log'     3.8544  0.8660   4.45 <0.0001 
##  Sedimentation.rate.log   0.2081  0.0553   3.76 0.0002  
##  Serum.Albumin           -0.5789  0.1418  -4.08 <0.0001 
##  Serum.Magnesium         -1.6249  0.5016  -3.24 0.0012  
##  Serum.Magnesium'         1.0067  0.5631   1.79 0.0738  
##  Systolic.BP              0.0077  0.0016   4.70 <0.0001 
##  Serum.Protein            0.2456  0.0878   2.80 0.0052  
##  Sex=female              -1.0500  0.0864 -12.15 <0.0001 
##  BMI.log                 -2.4128  0.4862  -4.96 <0.0001 
##  BMI.log'                 2.2768  0.5638   4.04 <0.0001 
##  Serum.Iron.sqrt          0.1596  0.1261   1.27 0.2054  
##  Serum.Iron.sqrt'        -3.2556  1.0373  -3.14 0.0017  
##  Serum.Iron.sqrt''       23.0333  6.9910   3.29 0.0010  
##  Serum.Iron.sqrt'''     -44.1589 14.4662  -3.05 0.0023  
##  Serum.Iron.sqrt''''     34.4816 13.2855   2.60 0.0094  
##  White.blood.cells.log    0.7798  0.1460   5.34 <0.0001 
## 

Univariable filtering

#Consider the predictor statistically significant with p < 0.05 from the univariate models
predictor <- c("Age", "Diastolic.BP", "Poverty.index.sqrt",
               "Sedimentation.rate.log","Serum.Albumin", "Serum.Cholesterol.log", "Serum.Iron.sqrt ","Serum.Magnesium",
               "Sex", "Systolic.BP","White.blood.cells.log", "Pulse.pressure.sqrt")

fit.univ<-lrm(death ~ rcs(Diastolic.BP,4)+rcs(Systolic.BP,4)+Age+rcs(Poverty.index.sqrt,3)+
                Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ rcs(Serum.Iron.sqrt,6)+ rcs(Serum.Magnesium,3)+
                Sex+White.blood.cells.log+rcs(Pulse.pressure.sqrt,3) ,data = survey_mice, y=T, x=T)
fit.univ
## Logistic Regression Model
##  
##  lrm(formula = death ~ rcs(Diastolic.BP, 4) + rcs(Systolic.BP, 
##      4) + Age + rcs(Poverty.index.sqrt, 3) + Sedimentation.rate.log + 
##      Serum.Albumin + Serum.Cholesterol.log + rcs(Serum.Iron.sqrt, 
##      6) + rcs(Serum.Magnesium, 3) + Sex + White.blood.cells.log + 
##      rcs(Pulse.pressure.sqrt, 3), data = survey_mice, x = T, y = T)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs          6989    LR chi2    1661.59    R2       0.364    C       0.854    
##   0           5888    d.f.            23    g        1.964    Dxy     0.708    
##   1           1101    Pr(> chi2) <0.0001    gr       7.125    gamma   0.708    
##  max |deriv| 1e-08                          gp       0.188    tau-a   0.188    
##                                             Brier    0.099                     
##  
##                         Coef     S.E.    Wald Z Pr(>|Z|)
##  Intercept                3.0360  2.2524   1.35 0.1777  
##  Diastolic.BP            -0.0606  0.0292  -2.08 0.0378  
##  Diastolic.BP'            0.0639  0.0455   1.40 0.1609  
##  Diastolic.BP''          -0.1075  0.1417  -0.76 0.4479  
##  Systolic.BP              0.0292  0.0289   1.01 0.3111  
##  Systolic.BP'             0.0440  0.0478   0.92 0.3567  
##  Systolic.BP''           -0.1277  0.1089  -1.17 0.2410  
##  Age                      0.0870  0.0040  21.68 <0.0001 
##  Poverty.index.sqrt      -0.0611  0.0157  -3.90 <0.0001 
##  Poverty.index.sqrt'      0.0400  0.0191   2.10 0.0360  
##  Sedimentation.rate.log   0.2188  0.0505   4.33 <0.0001 
##  Serum.Albumin           -0.4193  0.1267  -3.31 0.0009  
##  Serum.Cholesterol.log   -0.2179  0.1886  -1.16 0.2481  
##  Serum.Iron.sqrt          0.0650  0.1233   0.53 0.5978  
##  Serum.Iron.sqrt'        -2.7593  1.0253  -2.69 0.0071  
##  Serum.Iron.sqrt''       21.0484  6.9458   3.03 0.0024  
##  Serum.Iron.sqrt'''     -42.0525 14.4064  -2.92 0.0035  
##  Serum.Iron.sqrt''''     33.9359 13.2416   2.56 0.0104  
##  Serum.Magnesium         -1.7224  0.5002  -3.44 0.0006  
##  Serum.Magnesium'         1.0793  0.5618   1.92 0.0547  
##  Sex=female              -1.0430  0.0830 -12.56 <0.0001 
##  White.blood.cells.log    0.7274  0.1439   5.05 <0.0001 
##  Pulse.pressure.sqrt     -0.6130  0.3326  -1.84 0.0653  
##  Pulse.pressure.sqrt'     0.2757  0.1556   1.77 0.0763  
## 
fit.univ2<-lrm(death ~ rcs(Diastolic.BP,4)+Age+rcs(Poverty.index.sqrt,3)+
                Sedimentation.rate.log+Serum.Albumin+Serum.Cholesterol.log+ rcs(Serum.Iron.sqrt,6)+ rcs(Serum.Magnesium,3)+
                Sex+White.blood.cells.log ,data = survey_mice, y=T, x=T)
fit.univ2
## Logistic Regression Model
##  
##  lrm(formula = death ~ rcs(Diastolic.BP, 4) + Age + rcs(Poverty.index.sqrt, 
##      3) + Sedimentation.rate.log + Serum.Albumin + Serum.Cholesterol.log + 
##      rcs(Serum.Iron.sqrt, 6) + rcs(Serum.Magnesium, 3) + Sex + 
##      White.blood.cells.log, data = survey_mice, x = T, y = T)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs          6989    LR chi2    1641.47    R2       0.360    C       0.852    
##   0           5888    d.f.            18    g        1.969    Dxy     0.705    
##   1           1101    Pr(> chi2) <0.0001    gr       7.160    gamma   0.705    
##  max |deriv| 8e-09                          gp       0.187    tau-a   0.187    
##                                             Brier    0.100                     
##  
##                         Coef     S.E.    Wald Z Pr(>|Z|)
##  Intercept               -0.0632  1.7779  -0.04 0.9716  
##  Diastolic.BP            -0.0290  0.0127  -2.29 0.0217  
##  Diastolic.BP'            0.0829  0.0426   1.95 0.0515  
##  Diastolic.BP''          -0.1826  0.1335  -1.37 0.1714  
##  Age                      0.0916  0.0037  24.80 <0.0001 
##  Poverty.index.sqrt      -0.0622  0.0156  -3.99 <0.0001 
##  Poverty.index.sqrt'      0.0396  0.0190   2.09 0.0370  
##  Sedimentation.rate.log   0.2299  0.0504   4.57 <0.0001 
##  Serum.Albumin           -0.3835  0.1261  -3.04 0.0024  
##  Serum.Cholesterol.log   -0.2256  0.1876  -1.20 0.2292  
##  Serum.Iron.sqrt          0.0723  0.1234   0.59 0.5583  
##  Serum.Iron.sqrt'        -2.8645  1.0242  -2.80 0.0052  
##  Serum.Iron.sqrt''       21.6862  6.9312   3.13 0.0018  
##  Serum.Iron.sqrt'''     -43.1089 14.3669  -3.00 0.0027  
##  Serum.Iron.sqrt''''     34.6205 13.2011   2.62 0.0087  
##  Serum.Magnesium         -1.7665  0.4986  -3.54 0.0004  
##  Serum.Magnesium'         1.0658  0.5606   1.90 0.0573  
##  Sex=female              -1.0137  0.0819 -12.38 <0.0001 
##  White.blood.cells.log    0.7392  0.1433   5.16 <0.0001 
## 

Forward algorithm

nothing <- glm(death~1, family=binomial, data = survey_mice)
forwards <- step(nothing, scope=list(lower = formula(nothing),upper = formula(fit.multi.glm)),
                 direction ="forward",trace=FALSE )


#lrm model to compare with other
fit.for<-lrm(death ~ Age + Sex + rcs(Poverty.index.sqrt, 4) + Sedimentation.rate.log + 
    White.blood.cells.log + rcs(Red.blood.cells.log, 3) + rcs(Systolic.BP,4) + rcs(BMI.log, 4) + 
      rcs(Serum.Magnesium, 3) + rcs(Serum.Iron.sqrt, 
    6) + Serum.Albumin + rcs(Serum.Protein, 3) + TIBC, data = survey_mice, x=T, y=T)
fit.for
## Logistic Regression Model
##  
##  lrm(formula = death ~ Age + Sex + rcs(Poverty.index.sqrt, 4) + 
##      Sedimentation.rate.log + White.blood.cells.log + rcs(Red.blood.cells.log, 
##      3) + rcs(Systolic.BP, 4) + rcs(BMI.log, 4) + rcs(Serum.Magnesium, 
##      3) + rcs(Serum.Iron.sqrt, 6) + Serum.Albumin + rcs(Serum.Protein, 
##      3) + TIBC, data = survey_mice, x = T, y = T)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs          6989    LR chi2    1704.36    R2       0.372    C       0.857    
##   0           5888    d.f.            26    g        1.978    Dxy     0.714    
##   1           1101    Pr(> chi2) <0.0001    gr       7.228    gamma   0.714    
##  max |deriv| 4e-08                          gp       0.189    tau-a   0.190    
##                                             Brier    0.098                     
##  
##                         Coef     S.E.    Wald Z Pr(>|Z|)
##  Intercept                9.3347  3.1291   2.98 0.0029  
##  Age                      0.0872  0.0040  22.04 <0.0001 
##  Sex=female              -1.0849  0.0881 -12.32 <0.0001 
##  Poverty.index.sqrt      -0.0213  0.0287  -0.74 0.4590  
##  Poverty.index.sqrt'     -0.1375  0.1371  -1.00 0.3156  
##  Poverty.index.sqrt''     0.5080  0.4061   1.25 0.2110  
##  Sedimentation.rate.log   0.2078  0.0556   3.74 0.0002  
##  White.blood.cells.log    0.7904  0.1463   5.40 <0.0001 
##  Red.blood.cells.log     -2.7339  0.7465  -3.66 0.0003  
##  Red.blood.cells.log'     3.6870  0.8722   4.23 <0.0001 
##  Systolic.BP             -0.0143  0.0115  -1.24 0.2142  
##  Systolic.BP'             0.0721  0.0451   1.60 0.1100  
##  Systolic.BP''           -0.1523  0.1041  -1.46 0.1432  
##  BMI.log                 -2.6905  0.8082  -3.33 0.0009  
##  BMI.log'                 4.1543  2.7439   1.51 0.1300  
##  BMI.log''               -7.8545  8.9974  -0.87 0.3827  
##  Serum.Magnesium         -1.5115  0.5041  -3.00 0.0027  
##  Serum.Magnesium'         0.8885  0.5665   1.57 0.1168  
##  Serum.Iron.sqrt          0.1824  0.1267   1.44 0.1498  
##  Serum.Iron.sqrt'        -3.3076  1.0397  -3.18 0.0015  
##  Serum.Iron.sqrt''       22.9178  7.0056   3.27 0.0011  
##  Serum.Iron.sqrt'''     -43.3547 14.4986  -2.99 0.0028  
##  Serum.Iron.sqrt''''     33.4097 13.3147   2.51 0.0121  
##  Serum.Albumin           -0.5633  0.1449  -3.89 0.0001  
##  Serum.Protein           -0.0216  0.1795  -0.12 0.9042  
##  Serum.Protein'           0.2821  0.1821   1.55 0.1213  
##  TIBC                     0.0014  0.0008   1.88 0.0603  
## 
summary(fit.for)
##              Effects              Response : death 
## 
##  Factor                 Low      High     Diff.    Effect    S.E.    
##  Age                     35.0000  66.0000 31.00000  2.703500 0.122680
##   Odds Ratio             35.0000  66.0000 31.00000 14.932000       NA
##  Poverty.index.sqrt      11.7050  19.3130  7.60850 -0.391760 0.108140
##   Odds Ratio             11.7050  19.3130  7.60850  0.675870       NA
##  Sedimentation.rate.log   1.9459   3.0910  1.14510  0.237960 0.063618
##   Odds Ratio              1.9459   3.0910  1.14510  1.268700       NA
##  White.blood.cells.log    1.7918   2.1518  0.36000  0.284550 0.052684
##   Odds Ratio              1.7918   2.1518  0.36000  1.329200       NA
##  Red.blood.cells.log      1.4816   1.6194  0.13778  0.008542 0.058766
##   Odds Ratio              1.4816   1.6194  0.13778  1.008600       NA
##  Systolic.BP            118.0000 150.0000 32.00000  0.256800 0.108930
##   Odds Ratio            118.0000 150.0000 32.00000  1.292800       NA
##  BMI.log                  3.0974   3.3433  0.24586 -0.159050 0.102440
##   Odds Ratio              3.0974   3.3433  0.24586  0.852950       NA
##  Serum.Magnesium          1.5900   1.7700  0.18000 -0.152120 0.045735
##   Odds Ratio              1.5900   1.7700  0.18000  0.858890       NA
##  Serum.Iron.sqrt          8.6603  11.0000  2.33970 -0.186020 0.103070
##   Odds Ratio              8.6603  11.0000  2.33970  0.830260       NA
##  Serum.Albumin            4.2000   4.6000  0.40000 -0.225340 0.057971
##   Odds Ratio              4.2000   4.6000  0.40000  0.798250       NA
##  Serum.Protein            6.8000   7.4000  0.60000  0.113990 0.054043
##   Odds Ratio              6.8000   7.4000  0.60000  1.120700       NA
##  TIBC                   323.0000 396.0000 73.00000  0.104660 0.055709
##   Odds Ratio            323.0000 396.0000 73.00000  1.110300       NA
##  Sex - male:female        2.0000   1.0000       NA  1.084900 0.088062
##   Odds Ratio              2.0000   1.0000       NA  2.959000       NA
##  Lower 0.95 Upper 0.95
##   2.4631000  2.944000 
##  11.7410000 18.991000 
##  -0.6037100 -0.179810 
##   0.5467800  0.835430 
##   0.1132700  0.362650 
##   1.1199000  1.437100 
##   0.1812900  0.387810 
##   1.1988000  1.473700 
##  -0.1066400  0.123720 
##   0.8988500  1.131700 
##   0.0432990  0.470310 
##   1.0443000  1.600500 
##  -0.3598400  0.041737 
##   0.6977900  1.042600 
##  -0.2417600 -0.062482 
##   0.7852500  0.939430 
##  -0.3880300  0.015996 
##   0.6783900  1.016100 
##  -0.3389600 -0.111720 
##   0.7125100  0.894300 
##   0.0080670  0.219910 
##   1.0081000  1.246000 
##  -0.0045309  0.213850 
##   0.9954800  1.238400 
##   0.9122700  1.257500 
##   2.4900000  3.516500

Bacward and forward

bothways <- step(nothing, list(lower=formula(nothing), upper = formula(fit.multi.glm)), 
                 direction = "both", trace = 0)
fit.both<-(death ~ Age + Sex + Sedimentation.rate.log + rcs(Poverty.index.sqrt, 4) + 
             White.blood.cells.log + rcs(Systolic.BP, 4) + rcs(BMI.log, 4) + rcs(Serum.Magnesium,3) 
           + rcs(Serum.Iron.sqrt, 6) + rcs(Red.blood.cells.log, 4) + Serum.Albumin + 
             Serum.Protein + TIBC)
#same as forward

Models’ performance

######################## Performance statistics
s <- fit.multi.lrm$stats # performance of the estimated model 
round(s, digits=3) # C is the AUC and Dxy is related to C (Dxy=2*(C-0.5)), are both discrimination measures
##        Obs  Max Deriv Model L.R.       d.f.          P          C        Dxy 
##   6989.000      0.000   1716.335     33.000      0.000      0.858      0.716 
##      Gamma      Tau-a         R2      Brier          g         gr         gp 
##      0.716      0.190      0.374      0.098      1.986      7.287      0.190
gamma.hat <-  (s['Model L.R.'] - s['d.f.'])/s['Model L.R.'] #shrinkage coefficient


s1 <- fit.multi.lrm_res$stats # performance of the estimated model 
round(s1, digits=3) # C is the AUC and Dxy is related to C (Dxy=2*(C-0.5)), are both discrimination measures
##        Obs  Max Deriv Model L.R.       d.f.          P          C        Dxy 
##   6989.000      0.000   1680.786     21.000      0.000      0.855      0.709 
##      Gamma      Tau-a         R2      Brier          g         gr         gp 
##      0.710      0.188      0.368      0.098      1.953      7.047      0.188
gamma.hat1 <-  (s1['Model L.R.'] - s1['d.f.'])/s1['Model L.R.'] #shrinkage coefficient

#For both model C is equal 0.858, in the simpler model Dxy is slightly worse (from 0.717 to
# 0.715). But the shrinkage coedfficient increases (from 0.980 to 0.985)

s2 <- fit.bw1$stats # performance of the estimated model 
round(s2, digits=3)
##        Obs  Max Deriv Model L.R.       d.f.          P          C        Dxy 
##   6989.000      0.000   1689.999     19.000      0.000      0.856      0.712 
##      Gamma      Tau-a         R2      Brier          g         gr         gp 
##      0.712      0.189      0.369      0.098      1.980      7.240      0.189
gamma.hat2 <-  (s2['Model L.R.'] - s2['d.f.'])/s2['Model L.R.'] #shrinkage coefficient

s3 <- fit.for$stats # performance of the estimated model 
round(s3, digits=3) 
##        Obs  Max Deriv Model L.R.       d.f.          P          C        Dxy 
##   6989.000      0.000   1704.365     26.000      0.000      0.857      0.714 
##      Gamma      Tau-a         R2      Brier          g         gr         gp 
##      0.714      0.190      0.372      0.098      1.978      7.228      0.189
gamma.hat3 <-  (s3['Model L.R.'] - s3['d.f.'])/s3['Model L.R.'] #shrinkage coefficient

s4 <- fit.univ$stats # performance of the estimated model 
round(s4, digits=3) 
##        Obs  Max Deriv Model L.R.       d.f.          P          C        Dxy 
##   6989.000      0.000   1661.589     23.000      0.000      0.854      0.708 
##      Gamma      Tau-a         R2      Brier          g         gr         gp 
##      0.708      0.188      0.364      0.099      1.964      7.125      0.188
gamma.hat4 <-  (s4['Model L.R.'] - s4['d.f.'])/s4['Model L.R.'] #shrinkage coefficient

s5 <- fit.univ2$stats # performance of the estimated model 
round(s5, digits=3) 
##        Obs  Max Deriv Model L.R.       d.f.          P          C        Dxy 
##   6989.000      0.000   1641.467     18.000      0.000      0.852      0.705 
##      Gamma      Tau-a         R2      Brier          g         gr         gp 
##      0.705      0.187      0.360      0.100      1.969      7.160      0.187
gamma.hat5 <-  (s5['Model L.R.'] - s5['d.f.'])/s5['Model L.R.'] #shrinkage coefficient

Question 5: Model Discrimination and Evaluation

# Let us now use the bootstrap to further evaluate calibration and discrimination of this reduced model: 
## Full model
fmult <- update(fit.multi.lrm , x=TRUE , y=TRUE)
vmult <- validate (fmult, B=200)
#vmult # Dxy (0,7167 to 0.7049) (opt 0.0118) Slope (1-0.9629) (opt  0.0371)
cal.full  <- calibrate(fmult, B=200)


## Hand bw model
vhand <- validate (fit.multi.lrm_res, B=200)
#vhand             # Dxy 0.7093-0.7032 (opt 0.0061), slope 1 -0.9798 (opt 0.0202)
cal.hbw <- calibrate(fit.multi.lrm_res, B=200)



## Bw model
#fhand <- update(fit.multi.lrm_res, x=TRUE , y=TRUE)
vbw <- validate (fit.bw1, B=200)
#vbw  # Dxy(0.7136-0.7070) opt(0.0066) slope(1-0.9791) (opt 0.0209)
cal.bw <- calibrate(fit.bw1, B=200)



## forward model
ffor <- update(fit.for, x=TRUE , y=TRUE)
vfor <- validate (ffor, B=200)
#vfor  # Dxy(0.7141-0.7063) opt(0.0077) slope(1-0.9737) (opt 0.0263)
cal.for <- calibrate(ffor, B=200)


## univariate model
funiv <- update(fit.univ, x=TRUE , y=TRUE)
vuniv <- validate (funiv, B=200)
#vuniv  # Dxy( 0.7078-0.7000) opt(0.0078) slope(1-0.9753) (opt 0.0247)
cal.univ <- calibrate(funiv, B=200)


## univariate reduced model
funiv2 <- update(fit.univ2, x=TRUE , y=TRUE)
vuniv2 <- validate (funiv2, B=200)
#vuniv2  # Dxy(0.7048- 0.6999) opt(0.0049) slope(1-0.9838) (opt 0.0162)
cal.univ2 <- calibrate(funiv2, B=200)

Discrimination: AUC

# predicted probability by model

full.pred  <- predict(fit.multi.lrm, type="fitted") 
red.pred   <- predict(fit.multi.lrm_res, type="fitted") 
bw.pred   <- predict(fit.bw1, type="fitted") 
for.pred   <- predict(fit.for, type="fitted")
univ.pred<-predict(fit.univ, type="fitted")
univ.pred2<-predict(fit.univ2, type = "fitted")

dati.pp <- data.frame(survey_mice, full.pred, red.pred, bw.pred, for.pred, univ.pred, univ.pred2)
dati.pp$Race <- as.numeric(dati.pp$Race)
dati.pp$Sex <- as.numeric(dati.pp$Sex)
dati.pp$death <- as.numeric(dati.pp$death)
dati.pp$death <- dati.pp$death -1


###### compare FULL model and BACKWARD BY HAND model
dati.pp<-select(dati.pp, -c("systolic_test", "status"))
roc.comp <- roc(death ~ full.pred+red.pred, data = dati.pp)
summary.roc(roc.comp) 
 Call:
roc(formula = death ~ full.pred + red.pred, data = dati.pp)
Error Method: delong  
One-sided test (AUC > 0.5)
Markers:
                 ROC AUC  Std. Error  z value  Pr(>z)  lower.95  upper.95
     full.pred   0.8580      0.0056    63.48       0    0.8470    0.8691
     red.pred    0.8547      0.0057    61.87       0    0.8434    0.8659
     
     
     Overall Test for Equality of Areas: 
      Chi-square( 1 df ): 10.5167     P-value: 0.0012
# there is evidence that the 2 curves are statistically different.

par(pty="s")
model0 <- glm(death ~ full.pred, data = dati.pp, family=binomial)
lroc0  <- lroc2(model0,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="black")
model1 <- glm(death ~ red.pred, data = dati.pp, family=binomial)
plot.new
lroc1 <- lroc2(model1,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="red")
legend("bottomright", legend=c("Full Model","Hand Backward"), col=c("black", "red"), lty=c(1,1), bty="n")

# Indeed the plot shows no difference between the 2 curves

#### compare FULL model and BACKWARD BY ALGORITHM

roc.comp2 <- roc(death ~ full.pred+bw.pred, data = dati.pp)
summary.roc(roc.comp2) # little evidence of difference between the 2 curves, the full model is more discriminative,
    Call:
   roc(formula = death ~ full.pred + bw.pred, data = dati.pp)
   
   Error Method: delong 
   
   
   One-sided test (AUC > 0.5)
   Markers:
              ROC AUC  Std. Error  z value  Pr(>z)  lower.95  upper.95
   full.pred   0.8580      0.0056    63.48       0    0.8470    0.8691
   bw.pred     0.8559      0.0057    62.51       0    0.8447    0.8670
   
   
   Overall Test for Equality of Areas: 
    Chi-square( 1 df ): 6.4885     P-value: 0.0109
# however the 2 curves are comparable
# but full model is less calibrate
par(pty="s")
lroc0  <- lroc2(model0,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="black")
model3 <- glm(death ~ bw.pred, data = dati.pp, family=binomial)
plot.new
lroc3 <- lroc2(model3,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="red")
legend("bottomright", legend=c("Full Model","Backward Model"), col=c("black", "red"), lty=c(1,1), bty="n")

#### compare BACKWARD BY HAND and BACKWARD BY ALGORITHM
roc.comp3 <- roc(death ~ red.pred+bw.pred, data = dati.pp)
summary.roc(roc.comp3) # little evidence of difference: backward by hand more discriminative
 Call:
 roc(formula = death ~ red.pred + bw.pred, data = dati.pp)
 
 Error Method: delong 
 
 
 One-sided test (AUC > 0.5)
 Markers:
           ROC AUC  Std. Error  z value  Pr(>z)  lower.95  upper.95
 red.pred   0.8547      0.0057    61.87       0    0.8434    0.8659
 bw.pred    0.8559      0.0057    62.51       0    0.8447    0.8670
 
 
 Overall Test for Equality of Areas: 
  Chi-square( 1 df ): 3.1388     P-value: 0.0765
par(pty="s")
lroc4  <- lroc2(model1,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="black")
model5 <- glm(death ~ bw.pred, data = dati.pp, family=binomial)
plot.new
lroc5 <- lroc2(model5,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="red")
lroc5$auc
FALSE [1] 0.8558896
legend("bottomright", legend=c("Bw hand","Bw algorithm"), col=c("black", "red"), lty=c(1,1), bty="n")

# comparison of 3 models
par(pty="s")
model0 <- glm(death ~ full.pred, data = dati.pp, family=binomial)
lroc0  <- lroc2(model0,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="black")
model2 <- glm(death ~ red.pred, data = dati.pp, family=binomial)
plot.new
lroc22 <- lroc2(model2,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="red")
model22 <- glm(death ~ bw.pred, data = dati.pp, family=binomial)
plot.new
lroc222 <- lroc2(model22,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="blue")
legend("bottomright", legend=c("Full Model","Bw by hand", "bw by algorithm"), col=c("black", "blue", "red"), lty=c(1,1,1), bty="n")

# all the three models are comparable in terms of discriminative ability, but the backward model are more calibrate
# since are more parsimonious.

#### compare FORWARD  model and BACKWARD BY ALGORITHM

roc.comp4 <- roc(death ~ for.pred+bw.pred, data = dati.pp)
summary.roc(roc.comp4) # no evidence of difference
 Call:
 roc(formula = death ~ for.pred + bw.pred, data = dati.pp)
 
 Error Method: delong 
 
 
 One-sided test (AUC > 0.5)
 Markers:
           ROC AUC  Std. Error  z value  Pr(>z)  lower.95  upper.95
 for.pred   0.8570      0.0057    63.00       0    0.8459    0.8681
 bw.pred    0.8559      0.0057    62.51       0    0.8447    0.8670
 
 
 Overall Test for Equality of Areas: 
  Chi-square( 1 df ): 3.18     P-value: 0.0745
# comparison of 4 models
par(pty="s")
model0 <- glm(death ~ full.pred, data = dati.pp, family=binomial)
lroc0  <- lroc2(model0,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="black")
model2 <- glm(death ~ red.pred, data = dati.pp, family=binomial)
plot.new
lroc22 <- lroc2(model2,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="red")
model22 <- glm(death ~ bw.pred, data = dati.pp, family=binomial)
plot.new
lroc222 <- lroc2(model22,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="blue")
model23 <- glm(death ~ for.pred, data = dati.pp, family=binomial)
plot.new
lroc23 <- lroc2(model23,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="green")
model5 <- glm(death ~ univ.pred, data = dati.pp, family=binomial)
plot.new
lroc5 <- lroc2(model5,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="purple")
model6 <- glm(death ~ univ.pred2, data = dati.pp, family=binomial)
plot.new
lroc6 <- lroc2(model6,add=T,lty.choice=1,lwd=1,pch='.',cex=0.1,line.col="orange")
legend("bottomright", legend=c("Full Model","Bw by hand", "bw by algorithm", "forward", "univ", "univ_red"), col=c("black", "blue", "red","green", "purple", "orange"), lty=c(1,1,1), bty="n")

# all the three models are comparable in terms of discriminative ability, but the backward model are more calibrate
# since are more parsimonious.

# Backward by hand and bw by algorithm seem to be the best models. Their disriminative ability is very close to 
# the one of the full model, AIC smaller and are more calibrate

Summary: Multivariable Model chosen

#### Calibration
par(mfrow = c(2,3))
plot(cal.full, main = "Full model") # MAE 0.004
## 
## n=6989   Mean absolute error=0.004   Mean squared error=3e-05
## 0.9 Quantile of absolute error=0.009
plot(cal.hbw,main = "Bw by hand") # MAE 0.004
## 
## n=6989   Mean absolute error=0.004   Mean squared error=2e-05
## 0.9 Quantile of absolute error=0.008
plot(cal.bw, main = "Bw by algorithm") #MAE = 0.003
## 
## n=6989   Mean absolute error=0.002   Mean squared error=1e-05
## 0.9 Quantile of absolute error=0.006
plot(cal.for, main = "Forward by algorithm") #MAE = 0.003 -> fit.for.s has a better calibration (MAE=0.03)
## 
## n=6989   Mean absolute error=0.003   Mean squared error=3e-05
## 0.9 Quantile of absolute error=0.008
plot(cal.univ, main = "Univariate Filtering") #MAE = 0.003
## 
## n=6989   Mean absolute error=0.003   Mean squared error=4e-05
## 0.9 Quantile of absolute error=0.007
plot(cal.univ2, main = "Univariate Filtering reduced") #MAE = 0.003

## 
## n=6989   Mean absolute error=0.003   Mean squared error=3e-05
## 0.9 Quantile of absolute error=0.006
col<-c("full", "hand_bw", "bw","forward", "univ", "univ_red")

#### Shrinkage
gamma.t <- c(gamma.hat, gamma.hat1, gamma.hat2, gamma.hat3, gamma.hat4, gamma.hat5) 
# univ_red is the better,indicating that this model will validate 
#on new data about 1% worse than on this dataset (simpler than the full model, less overfitting).

m <- matrix(gamma.t, nrow = 1, byrow = TRUE, 
            dimnames = list("Shrink", col))
m
##            full   hand_bw        bw   forward      univ  univ_red
## Shrink 0.980773 0.9875058 0.9887574 0.9847451 0.9861578 0.9890342
#### AIC
AIC.models <- c(AIC(fit.multi.lrm), AIC(fit.multi.lrm_res), AIC(fit.bw1), AIC(fit.for), AIC(fit.univ), AIC(fit.univ2))
 

mAIC <- matrix(AIC.models, nrow = 1, byrow = TRUE, 
            dimnames = list("AIC", col))
mAIC # forward is the better
##         full  hand_bw       bw  forward     univ univ_red
## AIC 4439.877 4451.426 4438.213 4437.847 4474.623 4484.745
#### AUC
area <- c(lroc0$auc, lroc22$auc, lroc222$auc, lroc23$auc, lroc5$auc, lroc6$auc) 

mAuc <- matrix(area, nrow = 1, byrow = TRUE, 
            dimnames = list("AUC", col))
mAuc
##          full   hand_bw        bw   forward      univ  univ_red
## AUC 0.8580496 0.8546574 0.8558896 0.8570258 0.8539021 0.8523997

Considering all the results returned by different statistical procedures we can conclude that all the models are very close in term of discrimination and calibration. The best trade-off between this two measures is given by fit.bw1, that is the model obtained by the algorithm fastbw() removing the non linear effect of Poverty.index.sqrt (as suggested by anova and by p-value of coefficients).

fit.bw1
## Logistic Regression Model
##  
##  lrm(formula = death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log, 
##      3) + Sedimentation.rate.log + Serum.Albumin + rcs(Serum.Magnesium, 
##      3) + Systolic.BP + Serum.Protein + Sex + rcs(BMI.log, 3) + 
##      rcs(Serum.Iron.sqrt, 6) + White.blood.cells.log, data = survey_mice, 
##      x = TRUE, y = TRUE)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs          6989    LR chi2    1690.00    R2       0.369    C       0.856    
##   0           5888    d.f.            19    g        1.980    Dxy     0.712    
##   1           1101    Pr(> chi2) <0.0001    gr       7.240    gamma   0.712    
##  max |deriv| 1e-08                          gp       0.189    tau-a   0.189    
##                                             Brier    0.098                     
##  
##                         Coef     S.E.    Wald Z Pr(>|Z|)
##  Intercept                5.4151  2.1170   2.56 0.0105  
##  Age                      0.0862  0.0038  22.58 <0.0001 
##  Poverty.index.sqrt      -0.0295  0.0065  -4.55 <0.0001 
##  Red.blood.cells.log     -2.8444  0.7427  -3.83 0.0001  
##  Red.blood.cells.log'     3.8544  0.8660   4.45 <0.0001 
##  Sedimentation.rate.log   0.2081  0.0553   3.76 0.0002  
##  Serum.Albumin           -0.5789  0.1418  -4.08 <0.0001 
##  Serum.Magnesium         -1.6249  0.5016  -3.24 0.0012  
##  Serum.Magnesium'         1.0067  0.5631   1.79 0.0738  
##  Systolic.BP              0.0077  0.0016   4.70 <0.0001 
##  Serum.Protein            0.2456  0.0878   2.80 0.0052  
##  Sex=female              -1.0500  0.0864 -12.15 <0.0001 
##  BMI.log                 -2.4128  0.4862  -4.96 <0.0001 
##  BMI.log'                 2.2768  0.5638   4.04 <0.0001 
##  Serum.Iron.sqrt          0.1596  0.1261   1.27 0.2054  
##  Serum.Iron.sqrt'        -3.2556  1.0373  -3.14 0.0017  
##  Serum.Iron.sqrt''       23.0333  6.9910   3.29 0.0010  
##  Serum.Iron.sqrt'''     -44.1589 14.4662  -3.05 0.0023  
##  Serum.Iron.sqrt''''     34.4816 13.2855   2.60 0.0094  
##  White.blood.cells.log    0.7798  0.1460   5.34 <0.0001 
## 
confint.default(fit.bw1)
##                                2.5 %       97.5 %
## Intercept                1.265788898   9.56444370
## Age                      0.078750845   0.09372112
## Poverty.index.sqrt      -0.042235879  -0.01679788
## Red.blood.cells.log     -4.299935352  -1.38879861
## Red.blood.cells.log'     2.156936446   5.55176780
## Sedimentation.rate.log   0.099736610   0.31652637
## Serum.Albumin           -0.856930456  -0.30092611
## Serum.Magnesium         -2.607915772  -0.64187393
## Serum.Magnesium'        -0.096935532   2.11041824
## Systolic.BP              0.004483579   0.01089159
## Serum.Protein            0.073505846   0.41768159
## Sex=female              -1.219442236  -0.88058923
## BMI.log                 -3.365788052  -1.45975345
## BMI.log'                 1.171728220   3.38180726
## Serum.Iron.sqrt         -0.087428400   0.40672010
## Serum.Iron.sqrt'        -5.288672971  -1.22248731
## Serum.Iron.sqrt''        9.331134522  36.73551253
## Serum.Iron.sqrt'''     -72.512171187 -15.80554885
## Serum.Iron.sqrt''''      8.442552510  60.52072836
## White.blood.cells.log    0.493738111   1.06585391

We have to keep in mind that coefficients \(\Beta\) in logistic regression represents the change in the log odds when the predictors inscreasing by one unit. In order to interpret the results it easier reasoning on the Odds Ratio scale. The Odds ratio is a measure of association between the predictor and the outcome:

  • OR > 1 indicates an increase of the probability of the outcome
  • OR = 1 indicates a zero effect of the predictor on the probability of the outcome
  • 0R > 1 means that predictor has a protective effect, decreases the probbability of the outcome.

If the predictor is continous and we exponentiate its coefficient estimated by the model we obtain the estimate of OR for an increase of one unit of the predictor keeping all the other covariates constant, while from summary(model) we obtain the effect in an interquartile range. If the predictor is binary or categorical we obtain the estimate of OR of a level with respect to the chosen baseline level.

#Age
exp(fit.bw1$coefficients)[2]
##      Age 
## 1.090064
### Effect on odds ratio scale
# Interquartile-range odds ratios for continuous predictors and simple odds
# ratios for categorical predictors. Numbers at left are upper quartile vs lower quartile or
# current group vs reference group. The bars represent confidence limits.
# The intervals are drawn on the log odds ratio scale and labeled on the odds ratio
# scale. Ranges are on the original scale.
summary(fit.bw1)
##              Effects              Response : death 
## 
##  Factor                 Low      High     Diff.    Effect    S.E.    
##  Age                     35.0000  66.0000 31.00000  2.673300 0.118390
##   Odds Ratio             35.0000  66.0000 31.00000 14.488000       NA
##  Poverty.index.sqrt      11.7050  19.3130  7.60850 -0.224580 0.049375
##   Odds Ratio             11.7050  19.3130  7.60850  0.798850       NA
##  Red.blood.cells.log      1.4816   1.6194  0.13778  0.010799 0.058147
##   Odds Ratio              1.4816   1.6194  0.13778  1.010900       NA
##  Sedimentation.rate.log   1.9459   3.0910  1.14510  0.238340 0.063331
##   Odds Ratio              1.9459   3.0910  1.14510  1.269100       NA
##  Serum.Albumin            4.2000   4.6000  0.40000 -0.231570 0.056736
##   Odds Ratio              4.2000   4.6000  0.40000  0.793290       NA
##  Serum.Magnesium          1.5900   1.7700  0.18000 -0.156570 0.045607
##   Odds Ratio              1.5900   1.7700  0.18000  0.855070       NA
##  Systolic.BP            118.0000 150.0000 32.00000  0.246000 0.052311
##   Odds Ratio            118.0000 150.0000 32.00000  1.278900       NA
##  Serum.Protein            6.8000   7.4000  0.60000  0.147360 0.052681
##   Odds Ratio              6.8000   7.4000  0.60000  1.158800       NA
##  BMI.log                  3.0974   3.3433  0.24586 -0.221970 0.055039
##   Odds Ratio              3.0974   3.3433  0.24586  0.800940       NA
##  Serum.Iron.sqrt          8.6603  11.0000  2.33970 -0.180050 0.102540
##   Odds Ratio              8.6603  11.0000  2.33970  0.835220       NA
##  White.blood.cells.log    1.7918   2.1518  0.36000  0.280730 0.052543
##   Odds Ratio              1.7918   2.1518  0.36000  1.324100       NA
##  Sex - male:female        2.0000   1.0000       NA  1.050000 0.086444
##   Odds Ratio              2.0000   1.0000       NA  2.857700       NA
##  Lower 0.95 Upper 0.95
##   2.441300   2.905400 
##  11.488000  18.272000 
##  -0.321350  -0.127810 
##   0.725170   0.880020 
##  -0.103170   0.124770 
##   0.901980   1.132900 
##   0.114210   0.362460 
##   1.121000   1.436900 
##  -0.342770  -0.120370 
##   0.709800   0.886590 
##  -0.245960  -0.067182 
##   0.781950   0.935030 
##   0.143470   0.348530 
##   1.154300   1.417000 
##   0.044104   0.250610 
##   1.045100   1.284800 
##  -0.329850  -0.114100 
##   0.719030   0.892170 
##  -0.381030   0.020924 
##   0.683160   1.021100 
##   0.177750   0.383710 
##   1.194500   1.467700 
##   0.880590   1.219400 
##   2.412300   3.385300
plot(summary(fit.bw1),log=TRUE)

From this model we have a protective effect on the risk of death for :Poverty.index.sqrt, BMI.log, Serum.Albumin, Serum.Magnesium, BMI.log. For example for Poverty.index.sqrt means that being richer decreases the risk of death.

We have to be aware that for the value trnsformed we have to apply exp() or power of 2 if we use log or sqrt respectively. For example we have that patients with Poverty index of \((19,3)^2\)(136) have 25% less risk to death than patient with index $11,7 (361).

On the contrary the probability of death is increased by the predictors Age, Sedimentation.rate.log, Systolic.BP, Serum. Protein, White.blood.cell. This means that the increase of this predictors increses the risk of death. Furthermore it results that male subjects seem to have a 2.85 times higher risk than female ones.

Question 6: Present the results

Represent for the clinical audience the results from the estimated model

survey_mice$Sexf <- as.factor(survey_mice$Sex)
levels(survey_mice$Sexf)   <- c("male","female")
label(survey_mice$Sexf)    <- "Sex"

options(datadist='dd')
dd <-datadist(survey_mice)
formula(fit.bw1)
## death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log, 3) + 
##     Sedimentation.rate.log + Serum.Albumin + rcs(Serum.Magnesium, 
##     3) + Systolic.BP + Serum.Protein + Sex + rcs(BMI.log, 3) + 
##     rcs(Serum.Iron.sqrt, 6) + White.blood.cells.log
fredfin <- lrm(death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log, 3) + Sedimentation.rate.log + 
    Serum.Albumin + rcs(Serum.Magnesium, 3) + Systolic.BP + Serum.Protein + 
    Sex + rcs(BMI.log, 3) + rcs(Serum.Iron.sqrt, 6) + White.blood.cells.log ,data = survey_mice)
fredfin
## Logistic Regression Model
##  
##  lrm(formula = death ~ Age + Poverty.index.sqrt + rcs(Red.blood.cells.log, 
##      3) + Sedimentation.rate.log + Serum.Albumin + rcs(Serum.Magnesium, 
##      3) + Systolic.BP + Serum.Protein + Sex + rcs(BMI.log, 3) + 
##      rcs(Serum.Iron.sqrt, 6) + White.blood.cells.log, data = survey_mice)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs          6989    LR chi2    1690.00    R2       0.369    C       0.856    
##   0           5888    d.f.            19    g        1.980    Dxy     0.712    
##   1           1101    Pr(> chi2) <0.0001    gr       7.240    gamma   0.712    
##  max |deriv| 1e-08                          gp       0.189    tau-a   0.189    
##                                             Brier    0.098                     
##  
##                         Coef     S.E.    Wald Z Pr(>|Z|)
##  Intercept                5.4151  2.1170   2.56 0.0105  
##  Age                      0.0862  0.0038  22.58 <0.0001 
##  Poverty.index.sqrt      -0.0295  0.0065  -4.55 <0.0001 
##  Red.blood.cells.log     -2.8444  0.7427  -3.83 0.0001  
##  Red.blood.cells.log'     3.8544  0.8660   4.45 <0.0001 
##  Sedimentation.rate.log   0.2081  0.0553   3.76 0.0002  
##  Serum.Albumin           -0.5789  0.1418  -4.08 <0.0001 
##  Serum.Magnesium         -1.6249  0.5016  -3.24 0.0012  
##  Serum.Magnesium'         1.0067  0.5631   1.79 0.0738  
##  Systolic.BP              0.0077  0.0016   4.70 <0.0001 
##  Serum.Protein            0.2456  0.0878   2.80 0.0052  
##  Sex=female              -1.0500  0.0864 -12.15 <0.0001 
##  BMI.log                 -2.4128  0.4862  -4.96 <0.0001 
##  BMI.log'                 2.2768  0.5638   4.04 <0.0001 
##  Serum.Iron.sqrt          0.1596  0.1261   1.27 0.2054  
##  Serum.Iron.sqrt'        -3.2556  1.0373  -3.14 0.0017  
##  Serum.Iron.sqrt''       23.0333  6.9910   3.29 0.0010  
##  Serum.Iron.sqrt'''     -44.1589 14.4662  -3.05 0.0023  
##  Serum.Iron.sqrt''''     34.4816 13.2855   2.60 0.0094  
##  White.blood.cells.log    0.7798  0.1460   5.34 <0.0001 
## 
nom <-  nomogram (fredfin,
                  fun=plogis , funlabel ="Probability ",
                  lp=TRUE, 
                  fun.at =c(.01,.1,.25,.5,.75,.95))


plot(
  nom,
  cex.axis = 0.75 ,
  cex.var = 0.75 ,
  #col.grid = gray(c(0.8, 0.95))
  )

nom
## Points per unit of linear predictor: 23.19217 
## Linear predictor units per point   : 0.04311799 
## 
## 
##  Age Points
##  25    0   
##  30   10   
##  35   20   
##  40   30   
##  45   40   
##  50   50   
##  55   60   
##  60   70   
##  65   80   
##  70   90   
##  75  100   
## 
## 
##  Poverty.index.sqrt Points
##   0                 24    
##   5                 21    
##  10                 17    
##  15                 14    
##  20                 10    
##  25                  7    
##  30                  3    
##  35                  0    
## 
## 
##  Red.blood.cells.log Points
##  0.9                 39    
##  1.0                 32    
##  1.1                 26    
##  1.2                 19    
##  1.3                 12    
##  1.4                  6    
##  1.5                  0    
##  1.6                  0    
##  1.7                  6    
##  1.8                 13    
##  1.9                 20    
##  2.0                 27    
##  2.1                 34    
##  2.2                 40    
## 
## 
##  Sedimentation.rate.log Points
##  0.0                     0    
##  0.5                     2    
##  1.0                     5    
##  1.5                     7    
##  2.0                    10    
##  2.5                    12    
##  3.0                    14    
##  3.5                    17    
##  4.0                    19    
##  4.5                    22    
## 
## 
##  Serum.Albumin Points
##  2.5           47    
##  3.0           40    
##  3.5           34    
##  4.0           27    
##  4.5           20    
##  5.0           13    
##  5.5            7    
##  6.0            0    
## 
## 
##  Serum.Magnesium Points
##  0.8             36    
##  1.0             29    
##  1.2             21    
##  1.4             14    
##  1.6              6    
##  1.8              3    
##  2.0              2    
##  2.2              2    
##  2.4              1    
##  2.6              1    
##  2.8              0    
## 
## 
##  Systolic.BP Points
##   80          0    
##  100          4    
##  120          7    
##  140         11    
##  160         14    
##  180         18    
##  200         21    
##  220         25    
##  240         29    
##  260         32    
## 
## 
##  Serum.Protein Points
##   4             0    
##   5             6    
##   6            11    
##   7            17    
##   8            23    
##   9            28    
##  10            34    
##  11            40    
##  12            46    
## 
## 
##  Sex    Points
##  male   24    
##  female  0    
## 
## 
##  BMI.log Points
##  2.4     43    
##  2.6     32    
##  2.8     21    
##  3.0     10    
##  3.2      0    
##  3.4      0    
##  3.6      3    
##  3.8      7    
##  4.0     10    
##  4.2     14    
##  4.4     17    
## 
## 
##  Serum.Iron.sqrt Points
##   4               0    
##   6               7    
##   8              13    
##  10               4    
##  12               3    
##  14               8    
##  16              14    
##  18              20    
##  20              25    
## 
## 
##  White.blood.cells.log Points
##  0.5                    0    
##  1.0                    9    
##  1.5                   18    
##  2.0                   27    
##  2.5                   36    
##  3.0                   45    
##  3.5                   54    
##  4.0                   63    
##  4.5                   72    
## 
## 
##  Total Points Probability 
##           128         0.01
##           184         0.10
##           209         0.25
##           235         0.50
##           260         0.75
##           303         0.95

Question 7: Machine Learning model XGBoost

Why? XGBoost is an algorithm that has recently been dominating applied machine learning competitions, also in biomedical research. It has a wide range of applications: Can be used to solve regression, classification, ranking, and user-defined prediction problems. XGBoost stands for eXtreme Gradient Boosting and is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. Gradient Boosting specifically is an approach where new models are trained to predict the residuals (i.e errors) of prior models. Iteratively a new tree is built to predict the residual errors of the previous tree, which are then combined with previous tree for final predictions (Boosting). When adding new models the losses is minimized using gradient descent algorithm.

survey_mice$death <- as.factor(survey_mice$death)

#split data in train and test set
# splitting assuring that outcome distribution is equal in both train and test

trainIndex <- createDataPartition(survey_mice$death, p = .8, 
                                  list = FALSE, 
                                  times = 1)
surveyTrain <- survey_mice[ trainIndex,]
surveyTest  <- survey_mice[-trainIndex,]

options(na.action='na.pass')

#implement one hot encoding: all the variables must be numeric
new.tr <- model.matrix(~., data = surveyTrain %>% dplyr::select(-c(death, time_death, status))) 

options(na.action='na.pass')
new.ts <- model.matrix(~., data = surveyTest %>% dplyr::select(-c(death, time_death, status)))

#convert factor response to numeric
train.labels = as.numeric(surveyTrain$death)-1
test.labels = as.numeric(surveyTest$death)-1

dtrain <- xgb.DMatrix(data = new.tr, label = train.labels) 
dtest <- xgb.DMatrix(data = new.ts, label = test.labels)

#tuning parameter
# gbtree = ensemble of decision trees
params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1)


xgbcv <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, showsd = T,
                 stratified = T,print_every_n = 10, early_stopping_rounds = 20, maximize = F, metrics = "auc", prediction = T)
## [1]  train-auc:0.863656+0.004876 test-auc:0.809192+0.012647 
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 20 rounds.
## 
## [11] train-auc:0.954022+0.002061 test-auc:0.832408+0.011991 
## [21] train-auc:0.977327+0.002507 test-auc:0.835268+0.010011 
## Stopping. Best iteration:
## [1]  train-auc:0.863656+0.004876 test-auc:0.809192+0.012647
xgbcv_er <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, showsd = T,
                 stratified = T,print_every_n = 10, early_stopping_rounds = 20, maximize = F, metrics = "error", prediction = T)
## [1]  train-error:0.121692+0.002680   test-error:0.166483+0.011247 
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
## 
## [11] train-error:0.079399+0.002006   test-error:0.146814+0.007907 
## [21] train-error:0.056912+0.002411   test-error:0.148244+0.010093 
## [31] train-error:0.040549+0.002240   test-error:0.147528+0.012315 
## Stopping. Best iteration:
## [12] train-error:0.077074+0.002970   test-error:0.145384+0.007333
xgbcv_er
## ##### xgb.cv 5-folds
##  iter train_error_mean train_error_std test_error_mean test_error_std
##     1        0.1216916    0.0026802775       0.1664832    0.011247160
##     2        0.1102466    0.0014171098       0.1571850    0.009041203
##     3        0.1045686    0.0041199426       0.1552196    0.005343685
##     4        0.0997850    0.0028915333       0.1511062    0.006611158
##     5        0.0944650    0.0026314315       0.1512854    0.006275188
##     6        0.0918722    0.0023509253       0.1520002    0.008041120
##     7        0.0895476    0.0020719889       0.1507476    0.008527296
##     8        0.0867758    0.0008652018       0.1482436    0.009338382
##     9        0.0848534    0.0009458314       0.1489582    0.010910121
##    10        0.0823498    0.0019451724       0.1482446    0.007013073
##    11        0.0793992    0.0020064718       0.1468138    0.007906507
##    12        0.0770744    0.0029704292       0.1453836    0.007332844
##    13        0.0756886    0.0036086057       0.1468138    0.008772210
##    14        0.0730058    0.0033395608       0.1468140    0.009706942
##    15        0.0700558    0.0036734958       0.1471718    0.008832453
##    16        0.0678204    0.0027948312       0.1457406    0.010038517
##    17        0.0649144    0.0026219399       0.1464562    0.009232826
##    18        0.0625002    0.0018677320       0.1469922    0.011012184
##    19        0.0604884    0.0021593235       0.1477072    0.012866228
##    20        0.0591024    0.0016807219       0.1484228    0.011856811
##    21        0.0569120    0.0024113226       0.1482444    0.010092984
##    22        0.0544978    0.0029671626       0.1469924    0.012163061
##    23        0.0523518    0.0025344273       0.1478862    0.011497970
##    24        0.0491326    0.0022407130       0.1487802    0.011296364
##    25        0.0483276    0.0023441126       0.1482432    0.011909274
##    26        0.0468080    0.0029321842       0.1484220    0.012893784
##    27        0.0451984    0.0031890937       0.1478852    0.013350000
##    28        0.0437680    0.0022796510       0.1503892    0.011952926
##    29        0.0418456    0.0015942728       0.1487796    0.012279878
##    30        0.0414428    0.0025213293       0.1482434    0.012279533
##    31        0.0405488    0.0022401940       0.1475280    0.012314519
##    32        0.0396994    0.0020151109       0.1473492    0.012616106
##  iter train_error_mean train_error_std test_error_mean test_error_std
## Best iteration:
##  iter train_error_mean train_error_std test_error_mean test_error_std
##    12        0.0770744     0.002970429       0.1453836    0.007332844
# model training
xgb1 <- xgb.train (params = params, data = dtrain, nrounds = 79, watchlist = list(val=dtest,train=dtrain),
                     print_every_n = 10, early_stopping_round = 10, maximize = F , eval_metric = "error")
## [1]  val-error:0.153185  train-error:0.126252 
## Multiple eval metrics are present. Will use train_error for early stopping.
## Will train until train_error hasn't improved in 10 rounds.
## 
## [11] val-error:0.141732  train-error:0.085837 
## [21] val-error:0.150322  train-error:0.067597 
## [31] val-error:0.156764  train-error:0.044170 
## [41] val-error:0.152470  train-error:0.030937 
## [51] val-error:0.158196  train-error:0.017525 
## [61] val-error:0.161775  train-error:0.010193 
## [71] val-error:0.156049  train-error:0.006080 
## [79] val-error:0.152470  train-error:0.003398
#model prediction
xgbpred <- predict (xgb1,dtest)
xgbpred <- ifelse (xgbpred > 0.5,1,0)
xgbpredf<-as.factor(xgbpred)
test.labelsf<-as.factor(xgbpred)
confusionMatrix(xgbpredf, test.labelsf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1278    0
##          1    0  119
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9974, 1)
##     No Information Rate : 0.9148     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.9148     
##          Detection Rate : 0.9148     
##    Detection Prevalence : 0.9148     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 0          
## 
#Importance of variables
mat <- xgb.importance (feature_names = colnames(new.tr),model = xgb1)
xgb.ggplot.importance (importance_matrix = mat[1:19]) 

This ML algorithm perform very well on the test set, it correctly classify all the patients. But if we evaluate its performance by means of 5-folds cross-validation we see that it is comparable to the Logistic Regression model implemented. The AUC measure, indeed, is ~80%.

The list of most important variables returned reflects partially the ones used in the Logistic model. Both identified Age as the main contributor.

We have to consider that ML algorithm works very well on large and complex dataset and this could be a limitations when we deal with medical data.

Comments and Further extension

  • The log and sqrt transformation of some variables in order to make them more normally distribution leads to smaller SE

  • The Logistic model perform quite well with an AUC > 80%

  • We could try to stratify data by Sex to see if there are any differences between groups

  • We have highly correlated variables and this could impact the goodness of fit neverthless the estimate of coefficients. Dropping the most correlated variables is the easiest solution but not the most effective. There could be important interactions and thus possible cofounding variables not considered in this model. A clinician opinion could be useful to consider these intractions